This R Jupyter notebook starts with the annotated cell-gene matrix that is produced by align_and_annotate.ipynb and then further agumented with the PacBio information from pacbio_analysis.ipynb.
It performs some analyses using Monocle as well as many others just using R.
It analyzes IFN-enriched A549 cells infected with influenza (the enrichment was done by MACS sorting at 13 hours post-infection). The analysis also looks at the canine (MDCK) cell spike-ins used to estimate background amounts of flu in non-infected cells.
The main goal is to identify viral features associated with IFN induction
Analysis by Alistair Russell and Jesse Bloom.
options(warn=-1) # suppress warnings that clutter output
# install R packages
r_packages <- c(
"ggplot2", "ggExtra", "gridExtra", "cowplot", "xtable",
"scales", "reshape2", "dplyr", "magrittr", "rmarkdown",
"IRdisplay", "psych", "qlcMatrix", "colorRamps", "ggpubr",
"tidyverse", "RColorBrewer", "naturalsort", "grid",
"DescTools", "stringr", "ggrepel", "assertr", "gggenes",
"ggsignif")
suppressMessages(invisible(
lapply(r_packages, library, character.only=TRUE)))
# install Bioconductor packages
bioc_packages <- c("monocle", "piano", "qvalue", "Biostrings",
"genbankr")
suppressMessages(invisible(
lapply(bioc_packages, library, character.only=TRUE)))
sessionInfo()
Define some variables and functions that are used throughout the rest of the notebook.
# http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
# The palette with black
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
# plots will be saved here
plotsdir <- './results/plots/'
if (!dir.exists(plotsdir))
dir.create(plotsdir)
# figures for paper will be saved here
figsdir <- './paper/figures/single_cell_figures'
if (!dir.exists(figsdir))
dir.create(figsdir)
saveShowPlot <- function(p, width, height, isfig=FALSE) {
# save plot with filename of variable name with dots replaced by _, then show
# if *isfig* is TRUE, then also saves a PDF to *figsdir*
pngfile <- file.path(plotsdir, sprintf("%s.png",
gsub("\\.", "_", deparse(substitute(p)))))
figfile <- file.path(figsdir, sprintf("%s.pdf",
gsub("\\.", "_", deparse(substitute(p)))))
ggsave(pngfile, plot=p, width=width, height=height, units="in")
if (isfig)
ggsave(figfile, plot=p, width=width, height=height, units="in")
display_png(file=pngfile, width=width * 90)
}
fancy_scientific <- function(x, parse.str=TRUE, digits=NULL) {
# scientific notation formatting, based loosely on https://stackoverflow.com/a/24241954
# if `parse.str` is TRUE, then we parse the string into an expression
# `digits` indicates how many digits to include
x %>% format(scientific=TRUE, digits=digits) %>% gsub("^0e\\+00","0", .) %>%
gsub("^1e\\+00", "1", .) %>% gsub("^(.*)e", "'\\1'e", .) %>%
gsub("e\\+","e", .) %>% gsub("e", "%*%10^", .) %>%
gsub("^\'1\'\\%\\*\\%", "", .) %>% {if (parse.str) parse(text=.) else .}
}
We have two cell types in the experiments, and there is a separate cell-gene matrix for each cell type. The two cell types are:
For most analyses, the humanplusflu cell types is the one of interest, the canine cells are the "other" cell type used as a control. We therefore specify variables giving the celltype of interest and the other cell type.
We also specify which cell type has PacBio annotations (which are processed when reading the cell-gene matrix). This is only the humanplusflu cells.
celltype_interest <- "humanplusflu"
celltype_other <- "canine"
celltypes <- c(celltype_interest, celltype_other)
celltype_pacbio_annotated <- setNames(c(TRUE, FALSE), celltypes)
# prettier cell-type names for plots
celltypes_pretty <- setNames(
lapply(celltypes, function (c) gsub("plus", " + ", c)),
celltypes) %>% unlist
We load the cell-gene matrices that were created previously by align_and_annotate.ipynb, with the cells further annotated with mutations identified by pacbio_analysis.ipynb. These matrices include all cells identified by cellranger.
Note that the cells are annotated by how many flu reads contain the synonymous viral barcoces near the 3' ends by align_and_annotate.ipynb after the cellranger analysis. The cell type of interest is then further annotated with mutations identified by the PacBio analysis in pacbio_analysis.ipynb.
# Cell gene matrices in this directory
matrixdir <- "./results/cellgenecounts/"
# Read cell-gene matrices into a vector named by cell type.
# These are in `matrixdir` with names like "merged_humanplusflu_matrix.mtx",
# and for PacBio annotated cells "PacBio_annotated_merged_humanplusflu_cells.tsv"
all_cells <- lapply(
setNames(celltypes, celltypes),
function(celltype) {
if (celltype_pacbio_annotated[[celltype]]) {
cellsfile <- paste("PacBio_annotated_merged", celltype, "cells.tsv", sep="_")
} else {
cellsfile <- paste("merged", celltype, "cells.tsv", sep="_")
}
newCellDataSet(
readMM(file.path(matrixdir,
paste("merged", celltype, "matrix.mtx", sep="_"))),
phenoData=new(
"AnnotatedDataFrame", data=read.delim(file.path(matrixdir, cellsfile),
na.strings=c("None"))), # entries of None indicate NA
featureData=new(
"AnnotatedDataFrame",
data=read.delim(file.path(matrixdir,
paste("merged", celltype, "genes.tsv", sep="_")))
),
expressionFamily=negbinomial.size()
)
}
)
We determine the number of cells of each type as called by the 10X pipeline in align_and_annotate.ipynb. We separate the cells by whether they are "pure" of one type, or whether they are a cross-celltype multiplet (mix of humanplusflu and canine).
We first look for cell barcodes that are called as both humanplusflu and canine cells. There are cross-species multiplets, and should be filtered from downstream analyses. Their identification also provides an estimate of the rate of multiplets.
We annotate whether a cell is human / canine by adding a new pData attribute called cross_celltype_multiplet.
# Annotate cross-celltype multiplets
for (celltype in celltypes) {
# get all barcodes for the other cell type
other_cellbarcodes <- lapply(
celltypes,
function (other_celltype) {
if (celltype != other_celltype) {
pData(all_cells[[other_celltype]])$CellBarcode %>% as.character
}
}
) %>%
unlist
# mark shared barcodes as multiplets
pData(all_cells[[celltype]])$cross_celltype_multiplet <- all_cells[[celltype]] %>%
pData %$%
CellBarcode %in% other_cellbarcodes
}
Tabulate the number of cells that are pure of each type, plus the cross-celltype multiplets. The table below shows that most of the cells are humanplusflu. This is expected, since the canine cells were spiked in at a relatively low fraction (targeting about 5 to 10%). There are a modest number of cross-celltype multiplets.
tab_cellcounts <- lapply(
celltypes,
function (celltype) {
all_cells[[celltype]] %>%
pData %>%
mutate(celltype=celltype) %>%
group_by(celltype, cross_celltype_multiplet) %>%
summarize(ncells=n()) %>%
ungroup
}) %>%
bind_rows %>%
mutate(celltype=ifelse(cross_celltype_multiplet, "cell-type mix", celltype),
celltype_pretty=ifelse(cross_celltype_multiplet, "cell-type mix",
celltypes_pretty[celltype])) %>%
select(-cross_celltype_multiplet) %>%
unique %>%
transform(celltype_pretty=factor(celltype_pretty,
c(celltypes_pretty, "cell-type mix")))
tab_cellcounts
We estimate the multiplet frequency from the number of canine cells, humanplusflu cells, and cross-celltype multiplets using the function described in Bloom (2018, DOI 10.1101/293639). This calculation estimates the multiplet frequency at $\approx$10%.
# Function to compute multiplet frequency from:
# https://github.com/jbloomlab/multiplet_freq/blob/master/calcmultiplet_R.ipynb
# Note that for this function, n1 and n2 are droplets with any cells from that
# host (including multiplets).
multiplet_freq <- function(n1, n2, n12) {
n <- n1 * n2 / n12
mu1 <- -log((n - n1) / n)
mu2 <- -log((n - n2) / n)
mu <- mu1 + mu2
return (1 - mu * exp(-mu) / (1 - exp(-mu)))
}
tab_multiplet <- tab_cellcounts %>%
mutate(celltype=str_replace(celltype, "cell-type ", "celltype_")) %>%
select(-celltype_pretty) %>%
spread(celltype, ncells) %>%
mutate(n1=get(celltype_interest) + celltype_mix,
n2=get(celltype_other) + celltype_mix,
n12=celltype_mix,
multiplet_freq=multiplet_freq(n1, n2, n12))
tab_multiplet
Finally, we make a plot that shows number of pure cells of each type and cross-celltype multiplets, and is annnotated with the estimated multiplet frequency:
p_cellcounts <- ggplot(
tab_cellcounts,
aes(x=celltype_pretty, y=ncells)) +
geom_bar(width=0.75, position="dodge", stat="identity") +
geom_text(aes(label=ncells), hjust=-0.1) +
scale_y_continuous(name="number of cells",
expand=expand_scale(c(0.02, 0.23))) +
scale_x_discrete(name=NULL) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text=element_text(size=14)) +
coord_flip() +
annotate("text", x=2.5, y=400,
label=sprintf("multiplet rate = %.0f%%",
100 * tab_multiplet$multiplet_freq),
hjust=0,
fontface="italic")
saveShowPlot(p_cellcounts, width=3.5, height=0.5 * (1 + length(celltypes)))
In the previous section, we estimated the multiplet frequency.
However, we can also try to filter the multiplets and other low-quality cells. We do this in two ways:
First, we remove all of the cross-celltype multiplets from the dataset:
all_cells <- lapply(
all_cells,
function (c) {
c[, ! pData(c)$cross_celltype_multiplet]
}
)
Annotate cells with the number of total mRNAs, cellular mRNAs, total flu mRNAs, flu mRNAs for each gene, and fraction of total mRNA from flu. Note that for the canine cells, the flu genes are not in the reference genome, as the canine cells weren't infected. Nonetheless, they will have a small fraction of flu reads due to leakage. We get their number of flu reads from the pre-annotation of the cell-gene matrix performed by align_and_annotate.ipynb. For the humanplusflu cells, we calculate the flu reads from the cell-gene matrix.
# names of the flu genes
# use "flu" prefix so "NA" isn't interpreted as not any
flugenes <- c("fluPB2", "fluPB1", "fluPA", "fluHA",
"fluNP", "fluNA", "fluM", "fluNS")
flugenes_noprefix <- sapply(flugenes, function (x) gsub("flu", "", x)) %>% as.vector
# annotate by number of total, cellular, and flu mRNAs, and frac from flu
all_cells <- sapply(
names(all_cells),
function (name) {
c <- all_cells[[name]]
pData(c)$total_mRNAs <- Matrix::colSums(exprs(c))
pData(c)$cell_mRNAs <- Matrix::colSums(exprs(c[
row.names(subset(fData(c), !(gene_short_name %in% flugenes))),]))
flu_indices <- subset(fData(c), (gene_short_name %in% flugenes))
if (name == celltype_interest) {
# flu genes in cell-gene matrix, so calculate flu mRNAs from that
stopifnot(nrow(flu_indices) == length(flugenes))
pData(c)$flu_mRNAs <- Matrix::colSums(exprs(c[
row.names(flu_indices),]))
for (g in flugenes) {
pData(c)[[paste0(g, "_mRNAs")]] = Matrix::colSums(
exprs(c[row.names(subset(fData(c), gene_short_name == g)), ]))
}
} else {
# use flu count set during custom annotation of cell-gene matrix
stopifnot(name == celltype_other)
pData(c)$flu_mRNAs <- pData(c)$total_annotated_flu
for (g in flugenes) {
pData(c)[[paste0(g, "_mRNAs")]] = pData(c)[[
paste0("annotated_", g)]]
}
}
pData(c)$frac_mRNA_from_flu <- pData(c)$flu_mRNAs / pData(c)$total_mRNAs
return(c)
},
simplify=FALSE,
USE.NAMES=TRUE
)
Below we calculate the fraction of all mRNA that is from flu:
lapply(
all_cells,
function (c) {
pData(c) %>%
summarize(n_flu_mRNAs=sum(flu_mRNAs),
n_total_mRNAs=sum(total_mRNAs),
total_frac_flu=n_flu_mRNAs / n_total_mRNAs)
}) %>%
bind_rows(.id="cell_type")
Now we compute the median number of cellular mRNAs and exclude cells with substantially more or less than this median by setting upper and lower bounds. We set a filtered flag for all cells outside these bounds, as well as for any cells that are cross-celltype multiplets:
bottom_bound = 2 # exclude if this many fold less than median
top_bound = 2 # exclude if this many fold more than median
# annotate cells with bounds
all_cells <- lapply(
all_cells,
function (c) {
pData(c)$median_cell_mRNAs <- c %>%
pData %>%
mutate(median=median(cell_mRNAs)) %>%
select(median) %>%
unlist
pData(c)$lower_bound_cell_mRNAs <- pData(c)$median_cell_mRNAs / bottom_bound
pData(c)$upper_bound_cell_mRNAs <- pData(c)$median_cell_mRNAs * top_bound
pData(c)$filtered <- (
pData(c)$cell_mRNAs < pData(c)$lower_bound_cell_mRNAs |
pData(c)$cell_mRNAs > pData(c)$upper_bound_cell_mRNAs)
return(c)
}
)
# print the bounds and number of filtered cells
lapply(
all_cells,
function (c) {
pData(c) %>%
summarize(median_cell_mRNAs=mean(median_cell_mRNAs),
lower_bound_cell_mRNAs=mean(lower_bound_cell_mRNAs),
upper_bound_cell_mRNAs=mean(upper_bound_cell_mRNAs),
n_retained=sum(! filtered),
n_filtered=sum(filtered))
}
) %>%
bind_rows(.id="cell_type")
Plot the number of cellular and viral mRNAs and show the filtering. Each point is a cell, the blue lines show the density over number of cellular mRNAs per cell, the green dotted lines show the lower and upper bounds for filtering, and the orange rug plots show the distribution of the number of flu mRNAs per cell. This plot does not show the cross-celltype multiplets, which have already been filtered above.
# combine data for cell types to plot
mRNA_counts_data <- all_cells %>%
lapply(function (c) {pData(c)}) %>%
bind_rows(.id="celltype") %>%
mutate(celltype_pretty=celltypes_pretty[celltype]) %>%
transform(celltype_pretty=factor(celltype_pretty, celltypes_pretty))
# create scatter plot
scatterplot <- ggplot(mRNA_counts_data,
aes(cell_mRNAs, flu_mRNAs), color=cbbPalette[[1]]) +
geom_point(alpha=0.3) +
geom_rug(sides='l', alpha=0.15, color=cbbPalette[[3]], size=0.8)
# density plot, get value of maxy as here: https://stackoverflow.com/a/10659563
densityplot <- eval(substitute(
{stat_density(aes(x=cell_mRNAs, y=maxy*(..scaled..)),
geom="line", color=cbbPalette[[4]])},
list(maxy=layer_scales(scatterplot)$y$range$range[2])))
# combine into faceted plot
p_flu_vs_cell <- scatterplot + densityplot +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
geom_vline(aes(xintercept=lower_bound_cell_mRNAs),
color=cbbPalette[[8]], linetype='dashed') +
geom_vline(aes(xintercept=upper_bound_cell_mRNAs),
color=cbbPalette[[8]], linetype='dashed') +
facet_wrap(~ celltype_pretty, scales='free_x') +
xlab("cellular mRNAs per cell") +
ylab("flu mRNAs per cell") +
theme(strip.text=element_text(size=14), legend.position='none')
saveShowPlot(p_flu_vs_cell, width=6, height=3)
A key point in the above filtering is that we filtered based on the number of cellular mRNAs rather than total mRNAs. That approach makes sense if the flu RNAs are extra in additional to the cellular mRNAs, but less sense if they replace the cellular mRNAs so that the total mRNA of infected cells is similar to uninfected cells. In the latter case, filtering out low cellular mRNA cells would preferentially remove infected cells.
In our prior work (Russell et al (2018)), it seemed that viral mRNAs were mostly additional to cellular ones, supporting the idea of filtering on cellular mRNAs. To confirm that it is also the case here, we correlate the fraction of mRNA from flu with total and cellular mRNA. We see that for the infected cells (humanplusflu), it is much more correlated with total mRNA, with infected cells (lots of mRNA from flu) tending to have more total mRNA. Therefore, it makes more sense to filter on cellular mRNA as we have done above since it appears that highly infected cells tend to have similar amounts of cellular mRNA but more total mRNA.
# creates string of correlation (R and P-value)
cor_string <- function (x1, x2) {
cor <- cor.test(x1, x2, method='pearson')
sprintf("R = %.2f (P = %.1g)", cor$estimate, cor$p.value)
}
mRNA_counts_data %>%
group_by(celltype) %>%
summarize(fracflu_vs_total_mRNA=cor_string(
frac_mRNA_from_flu, total_mRNAs),
fracflu_vs_cellular_mRNA=cor_string(
frac_mRNA_from_flu, cell_mRNAs))
We now remove the cells that we have defined above for filtering based on having far more or less mRNA than the typical cell of that type:
all_cells <- lapply(
all_cells,
function (c) {
c[, ! pData(c)$filtered]
}
)
We want to determine which cells we think are truly infected with flu various versus just having flu mRNA they have picked up from lysed cells. We also want to call presence of each individual gene in infected cells. The canine cells provide a way to examine this, since they are not infected with flu and so should provide a baseline for how much viral mRNA to expect in non-infected cells.
We can imagine two relatively simple scenarios to explain the amount of contaminating flu mRNA that cells simply pick up from the environment:
In the first case, the number of flu mRNAs should be independent of the number of total mRNAs. In the second case, the number of flu mRNAs should increase linearly with the number of total mRNAs.
First, we make a plot of the relationship between the number of flu mRNAs and total mRNAs for the canine cells. As that plot shows, the relationship appears linear, suggesting the constant fraction model:
threshold_data <- all_cells[[celltype_other]] %>%
pData %>%
rename_all(funs(str_replace(., "annotated_", "")))
p_threshold_linear <- ggplot(
threshold_data, aes(total_mRNAs, flu_mRNAs)) +
geom_point() +
geom_smooth(method='lm')
saveShowPlot(p_threshold_linear, width=3.5, height=3)
Next, we estimate the number or fraction of mRNA in these canine cells that comes from flu.
Under the constant number of contaminating flu mRNAs model, this is just the average number of flu mRNAs per canine cell.
Under the constant fraction of contaminating flu mRNAs model, there are two ways we could compute the average fraction:
Below we see that these approaches for the constant fraction model given essentially identical results, so we just use the first approach:
threshold_data %>%
summarize(
mean_frac_flu=mean(frac_mRNA_from_flu),
weighted_mean_frac_flu=sum(flu_mRNAs) / sum(total_mRNAs),
mean_n_flu=mean(flu_mRNAs)
)
We now annotate cells by the predicted number of mRNAs from flu under both the constant number and constant fraction model. We also plot the relationship between the predicted and observed flu mRNAs in each model. The plot again suggests that the constant fraction model might be better:
threshold_data <- threshold_data %>%
mutate(mean_frac_flu=mean(frac_mRNA_from_flu),
mean_n_flu=mean(flu_mRNAs)) %>%
mutate(predicted_flu_mRNAs_constant_frac=mean_frac_flu * total_mRNAs,
predicted_flu_mRNAs_constant_number=mean_n_flu)
p_actual_predicted_flu <- ggplot(
threshold_data %>%
mutate(constant_fraction=predicted_flu_mRNAs_constant_frac,
constant_number=predicted_flu_mRNAs_constant_number) %>%
gather(key='predict_method', value='predicted_flu_mRNAs',
constant_fraction, constant_number),
aes(predicted_flu_mRNAs, flu_mRNAs)) +
geom_point() +
facet_wrap(~ predict_method)
saveShowPlot(p_actual_predicted_flu, width=5, height=3)
We now test if the predictions fit as well as expected if the data really are Poisson distributed from the predicted values under either the constant fraction or constant number model. We do this using the Poisson deviance goodness of fit test described here. The larger the deviance, the worse the fit. The P-value indicates if we can reject we can reject the Poisson distributed numbers around the predicted values.
Below, we see that both models can be rejected (neither fit the data adequately). However, going by both the deviance and the P-value, as well as the qualitative plots above, the predicted values are better under the constant fraction model.
So this provides a rationale by identifying infected humanplusflu cells based on the fraction of their mRNA from flu rather than the total number of mRNAs from flu.
threshold_data %>%
mutate(constant_fraction=predicted_flu_mRNAs_constant_frac,
constant_number=predicted_flu_mRNAs_constant_number) %>%
gather(key='predict_method', value='predicted_flu_mRNAs',
constant_fraction, constant_number) %>%
mutate(dev_term=flu_mRNAs * log(flu_mRNAs / predicted_flu_mRNAs) -
(flu_mRNAs - predicted_flu_mRNAs)) %>%
group_by(predict_method) %>%
summarize(deviance=2 * sum(dev_term),
ndegrees=n() - 1,
P=pchisq(deviance, ndegrees, lower.tail=FALSE))
The human cells were infected with a mix of wildtype and synonymously barcoded virus.
If the canine cells are just picking up environmental viral mRNA, then each should have a mix of mRNAs from these two viruses that should match the average composition of the humanplusflu cells.
Below we verify that this is true: we see that the fraction of flu reads with annotated wildtype or synonymous barcodes is about the same in canine and humanplusflu cells. Therefore, for the rest of this section, we will just use the fraction estimated from the canine cells.
Interestingly, the table below shows that the fraction of flu reads that have an annotated barcode is different between canine and humanplusflu cells, however. Not sure of the reason (it could have something to do with preferential leakage of defective particles), but this means that we will only extrapolate from canine to humanplusflu cells the fraction of annotated reads that are wildtype or synonymous, not the fraction of annotated reads.
lapply(all_cells, function (c) pData(c)) %>%
bind_rows(.id="celltype") %>%
group_by(celltype) %>%
summarize(n_wt_flu=sum(annotated_flu_wt),
n_syn_flu=sum(annotated_flu_syn),
frac_flu_wt=n_wt_flu / (n_wt_flu + n_syn_flu),
n_annotated=n_wt_flu + n_syn_flu,
frac_annotated=n_annotated / sum(flu_mRNAs))
Now we examine how well the fraction of flu reads from each barcode actually follows the predicted value assuming equal mixing. We obviously expect more statistical noise for cells with fewer reads with annotated barcodes, so we plot the fraction from each barcode as a function of the number of barcode-annotated reads. In the plots below, the horizontal line indicates the expected fraction from the averages. As the plot shows, the actual fractions are clustered around the horizontal line, with the clustering getting tighter as we increase the number of flu reads. This is as expected.
threshold_data <- threshold_data %>%
mutate(avg_frac_flu_wt=sum(flu_wt) /
(sum(flu_wt) + sum(flu_syn))) %>%
mutate(n_annotated_flu=flu_wt + flu_syn,
frac_flu_wt=flu_wt / n_annotated_flu,
predicted_flu_wt=avg_frac_flu_wt * n_annotated_flu)
p_barcode_fraction <- ggplot(
threshold_data, aes(n_annotated_flu, frac_flu_wt)) +
geom_point(alpha=0.33) +
geom_hline(aes(yintercept=avg_frac_flu_wt), color=cbbPalette[[2]])
saveShowPlot(p_barcode_fraction, width=3.5, height=3)
We now determine whether the observed "noise" in the fraction of flu with each barcode is consistent with a binomial distribution parameterized by the average fraction flu across cells. We do this using the binomial deviance goodness of fit test, calculating the deviance using the formula provided here. As can be seen below, we cannot reject the null hypothesis of a binomial distribution, suggesting we can interpret the binomial as a good fit, and assume equal mixing of barcodes in the canine cells.
threshold_data %>%
filter(n_annotated_flu > 0) %>%
mutate(dev_term1=ifelse(flu_wt == 0,
0,
flu_wt * log(flu_wt / predicted_flu_wt)),
dev_term2=ifelse(flu_syn == 0,
0,
log(flu_syn / (n_annotated_flu -
predicted_flu_wt))),
dev_term=dev_term1 + dev_term2) %>%
summarize(deviance=2 * sum(dev_term),
ndegrees=n() - 1,
P=pchisq(deviance, ndegrees, lower.tail=FALSE))
Now we look at the individual segment frequencies to come up with thresholds to call whether each cell is expressing each segment.
First, annotate all cells by the fraction of flu mRNA that comes from each segment:
all_cells <- lapply(
all_cells,
function (c) {
pData(c)[sapply(flugenes, function (x) paste0(x, "_frac"))] <- (
pData(c)[sapply(flugenes, function (x) paste0(x, "_mRNAs"))] /
pData(c)$flu_mRNAs)
c
}
)
Now let's get the total number of flu mRNAs for each gene:
flu_gene_freqs <- lapply(all_cells, function (c) pData(c)) %>%
bind_rows(.id="celltype") %>%
group_by(celltype) %>%
mutate_at(sapply(flugenes, function (x) paste0(x, "_mRNAs")),
return) %>%
summarize_at(c(flugenes, "flu_mRNAs"), sum)
flu_gene_freqs
Now we plot these data to see if the fractions from each gene are similar among the canine and humanplusflu cell types. We estimate the statistical noise on each estimate from counting statistics, and plot error bars that show 1.96 times the estimated standard deviation (95% confidence interval).
Overall, the plot below shows that humanplusflu cells look fairly similar to the canine ones in the distributions of reads among flu genes. The overall expression profiles for genes is also roughly similar to those from our previous study (Russell et al, 2018), with NS, M, NA and NP being highest, and then HA, and then the three polymerase genes.
Based on the plot below, we will use the gene frequencies in the canine cells to estimate cutoffs for whether each cell in humanplusflu is expressing a gene.
p_flu_gene_freqs <- ggplot(
flu_gene_freqs %>%
gather(gene, counts, -celltype, -flu_mRNAs) %>%
mutate(frac=counts / flu_mRNAs,
frac_sd=1.96 * sqrt(counts) / flu_mRNAs) %>%
transform(gene=factor(gsub("flu", "", gene), flugenes_noprefix)),
aes(gene, weight=frac, fill=celltype)) +
geom_bar(position=position_dodge()) +
geom_errorbar(aes(ymin=frac - frac_sd,
ymax=frac + frac_sd,
fill=celltype),
width=0.3, position=position_dodge(width=0.9)) +
ylab("fraction of flu mRNAs") +
theme(axis.text.x=element_text(angle=90,hjust=1, vjust=0.5)) +
scale_fill_manual(values=c(cbPalette[1], cbPalette[3]))
saveShowPlot(p_flu_gene_freqs, width=5, height=3)
We will now call infection status and presence of each flu gene from the background thresholds determined from the canine cells.
We will do this just for the humanplusflu cells, as these are what we will use for downstream analyses.
So we get those cells into a new variable called cells.
cells <- all_cells[[celltype_interest]]
Above, we've determined that the canine cells provide a good indication background fraction of contaminating reads from flu and for each flu gene. Now, we need to get the exact numbers for this background expection. Cells will be called as "infected" and also for the presence of individual genes if they have significantly more than this fraction of mRNA from flu or the indicated flu gene.
thresholds <- threshold_data %>%
mutate(flu=flu_mRNAs, total=total_mRNAs) %>%
summarize_at(c(flugenes, "flu", "total"), sum) %>%
gather(gene, frac, -total) %>%
mutate(frac=frac / total,
percent=frac * 100) %>%
select(-total)
thresholds
Above we noted number of flu reads per cell in the canine cells appear vaguely consistent with Poisson sampling relative to the expected number from a constant contaminating fraction. So for each cell, we can calculate a P-value to reject the null hypothesis that the number of observed flu reads is consistent with them just being from the contaminating background. This function does that test:
poisson_P <- function(n, mu){
# Probability of observing >= `n` given Poisson with mean `mu`
poisson.test(n, r=mu, alternative="greater")$p.value
}
We compute the P-value with which we can reject the null hypothesis that the amount of total flu mRNA is consistent with the background rates, both for total flu and for each gene individually, and also for all flu and by viral barcode.
for (g in c(flugenes, "flu")) {
# get thresholds for this gene
threshold <- thresholds %>% filter(gene == g)
# annotate cells with P values for all flu
pData(cells)[[paste0("has_", g, "_Pval")]] <- map2(
pData(cells)[[paste0(g, "_mRNAs")]],
pData(cells)$total_mRNAs * threshold$frac,
poisson_P) %>%
unlist
}
Now we categorize cells as whether they are infected. We have a lot of cells, so we compute a Q-value for whether the cell is infected. This will allow us to call infection status at some specified false discovery rate (FDR).
pData(cells)$infected_Q <- pData(cells) %>%
mutate(infected_Q=qvalue(has_flu_Pval)$qvalue) %$%
infected_Q
Now let's plot the distribution of Q values for cells being infected (actually, we plot the $\log$ Q-value). As is immediately apparent, the majority of cells have Q values close to 1 ($\log$s close to 0), meaning we can reject that they are infected. But a moderate fraction of cells have very low Q-values, indicating that they are clearly infected.
p_qvals <- ggplot(
pData(cells) %>% mutate(logQ=pmax(-100, log10(infected_Q))),
aes(logQ)) +
stat_ecdf() +
coord_cartesian(xlim=c(-12, 0)) +
ylab("cumulative fraction") +
xlab("log10(Q)")
saveShowPlot(p_qvals, 4, 2.5)
The Q-value distributions above indicate that it doesn't matter much where we set the FDR for calling infected cells, as long as we choose some reasonably small value (e.g., $< 0.1$). Since for our analyses it will be worse to falsely call infected cells than to miss a few, and since the distribution of flu reads per cell was not quite Poisson (and overdispersion will make the P-values liberal), we will be conservative and call infected cells at a FDR of $10^{-8}$, which corresponds to -8 on the log Q-value plot above. Note that this is a ridiculously low FDR; however, since the number of flu reads per cell used to estimate the background isn't exactly Poisson, we wanted to be cautious and only get very clearly infected cells. We then print some basic statistics on the infected and uninfected cells:
fdr_cutoff <- 1e-8
pData(cells)$infected <- pData(cells)$infected_Q < fdr_cutoff
pData(cells) %>%
group_by(infected) %>%
summarize(n=n(),
mean_cellular_mRNAs=mean(cell_mRNAs),
mean_total_mRNAs=mean(total_mRNAs),
min_frac_mRNA_from_flu=min(frac_mRNA_from_flu),
max_frac_mRNA_from_flu=max(frac_mRNA_from_flu))
Now we get the cutoff minimum fraction of flu at which we call cells infected. Note that the maximimum fraction mRNA from flu in the uninfected cells and the minimum fraction in infected cells are actually very slightly different, since the statistical calling method above also takes into account the total number of reads in a cell in calling the P-value. But they are very close, and we will make some plots above using the minimum fraction of flu above which all cells are called infected:
min_frac_flu <- pData(cells) %>%
filter(! infected) %$%
frac_mRNA_from_flu %>%
max
cat("All cells with >", min_frac_flu, "mRNA from flu are called infected")
# get data with infection status into `all_cells`
all_cells[[celltype_interest]] <- cells
# data for histograms
hist_data <- lapply(
names(all_cells),
function (celltype) {
all_cells[[celltype]] %>%
pData %>%
mutate(celltype=celltype,
celltype_pretty=celltypes_pretty[celltype],
infected=ifelse(celltype == celltype_interest,
infected,
frac_mRNA_from_flu > min_frac_flu)) %>%
select(celltype, celltype_pretty, frac_mRNA_from_flu, infected)
}
) %>% bind_rows
# histograms for all cells
p_frac_flu_histograms <- ggplot(
hist_data,
aes(frac_mRNA_from_flu, fill=infected)) +
geom_histogram(bins=33) +
scale_x_continuous() +
scale_fill_manual(values=c(cbPalette[4], cbPalette[3]),
name="classified\nas infected?") +
ylab("number of cells") +
xlab("frac mRNA from flu") +
facet_wrap(~celltype_pretty, ncol=1, scales='free_y') +
theme(strip.text=element_text(size=14), legend.position='none')
# histogram for infected cells
p_frac_flu_infected_histogram <- ggplot(
hist_data %>% filter(celltype == celltype_interest & infected),
aes(frac_mRNA_from_flu)) +
geom_histogram(bins=32, fill=cbPalette[3]) +
scale_x_continuous(name=NULL, limits=c(0, NA)) +
ylab(NULL) +
theme(plot.title=element_text(face='plain', vjust=-2))
# counts of cells infected / uninfected
cellcounts <- hist_data %>%
transform(infected=factor(infected)) %>%
group_by(celltype, infected) %>%
summarize(n=n()) %>%
complete(infected) %>%
mutate(n=ifelse(is.na(n), 0, n)) %>%
mutate(infected=infected %>% as.logical)
# merge into one figure
p_frac_flu <- ggdraw() +
draw_plot(p_frac_flu_histograms, 0, 0, 1, 1) +
draw_plot(p_frac_flu_infected_histogram, 0.4, 0.17, 0.6, 0.28) +
draw_label(
paste(cellcounts %>% filter(celltype == celltype_other & ! infected) %$% n,
"uninfected"),
x=0.96, y=0.91, colour=cbPalette[4], size=12, hjust=1, fontface='italic') +
draw_label(
paste(cellcounts %>% filter(celltype == celltype_other & infected) %$% n,
"infected"),
x=0.96, y=0.87, colour=cbPalette[3], size=12, hjust=1, fontface='italic') +
draw_label(
paste(cellcounts %>% filter(celltype == celltype_interest & ! infected) %$% n,
"uninfected"),
x=0.96, y=0.48, colour=cbPalette[4], size=12, hjust=1, fontface='italic') +
draw_label(
paste(cellcounts %>% filter(celltype == celltype_interest & infected) %$% n,
"infected"),
x=0.96, y=0.44, colour=cbPalette[3], size=12, hjust=1, fontface='italic')
saveShowPlot(p_frac_flu, width=3, height=4.5)
Now we want to call whether each infected cell is expressing each individual flu gene above background. We've already used a method to control for multiple hypotheses (controlling the FDR) above, so now we make the call for each gene cell individually using just the P-values without a further multiple hypothesis test. The reason that we do not need another multiple hypothesis test is that we are conditioning on the cells already having been called as infected.
First, we plot the distribution of P-values among infected cells:
pvals <- pData(cells) %>%
filter(infected) %>%
select(CellBarcode, matches('_Pval')) %>%
gather(gene, P, -CellBarcode) %>%
mutate(gene=gsub("has_|_Pval", "", gene)) %>%
filter(gene != "infected", gene != "flu")
p_pvals <- ggplot(
pvals %>% mutate(logP=pmax(-100, log10(P))),
aes(logP)) +
stat_ecdf() +
coord_cartesian(xlim=c(-3, 0)) +
facet_wrap(~ gene, nrow=1) +
ylab("cumulative fraction") +
xlab("log10(P)")
saveShowPlot(p_pvals, 10, 2.5)
The plot above clearly shows that it will be harder to set a clean P-value cutoff for whether cells have a specific flu gene. For some of the more highly expressed genes (e.g., NS, M) it is pretty clear. But for lower-expressed genes, it's harder to make a clean division. So we'll choose a P-value cutoff of 0.05, which seems plausible given above and also means that we're pretty sure that most of the time we're not falsely calling cells as having a flu gene that they really are missing.
p_has_gene_cutoff <- 0.05
pData(cells)[map_chr(flugenes, function (g) paste0("has_", g))] <- (
(pData(cells)[map_chr(flugenes, function(g) paste0("has_", g, "_Pval"))]
< p_has_gene_cutoff) & pData(cells)$infected)
Compute the number of distinct flu genes per cell based on the gene presence / absence calling above:
pData(cells)['n_flu_genes'] <- pData(cells) %>%
select_(.dots=lapply(flugenes, function (g) paste0("has_", g))) %>%
rowSums
Now plot the distribution of the number of flu genes per cell among the cells we have previously called as infected.
p_flugene_count <- ggplot(
pData(cells) %>% filter(infected), aes(n_flu_genes)) +
geom_bar(fill=cbPalette[3]) +
ylab("number of cells") +
geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
expand_limits(y=c(0, 190)) +
theme(axis.text.x=element_blank(),
axis.title.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
p_fracflu_by_ngenes <- ggplot(
pData(cells) %>% filter(infected),
aes(factor(n_flu_genes), frac_mRNA_from_flu)) +
geom_boxplot(outlier.size=1, outlier.alpha=0.4,
fill=cbPalette[3], outlier.color=cbPalette[3]) +
xlab("number of flu genes") +
ylab("frac mRNA from flu")
p_nflu_genes <- plot_grid(p_flugene_count, p_fracflu_by_ngenes,
ncol=1, align='v', rel_heights=c(1, 1.3))
saveShowPlot(p_nflu_genes, width=3.5, height=4)
Here we plot and show statistics on the fraction of infected cells that have each gene. Typically genes are present in infected cells >85% of the time but <95% of the time, which is comparable to prior estimates.
The exception is that we call almost all cells as having NP, which may just be because we have a hard time detecting infection if cells lack NP due to failure of secondary transcription.
tab_frac_has_gene <- pData(cells) %>%
filter(infected) %>%
rename_all(funs(str_replace_all(., 'has_flu', ''))) %>%
select(flugenes_noprefix) %>%
summarize_all(mean)
p_frac_has_gene <- ggplot(
tab_frac_has_gene %>%
gather(gene, frac) %>%
transform(gene=factor(gene, flugenes_noprefix)),
aes(x=gene, y=frac)) +
geom_bar(stat='identity', fill=cbPalette[3]) +
geom_text(aes(label=sprintf("%0.2f", frac)), vjust=1.5, color='white') +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
ylab("fraction cells with gene") +
geom_hline(yintercept=1, linetype='dashed', color=cbPalette[1])
saveShowPlot(p_frac_has_gene, width=3.5, height=2.75, isfig=TRUE)
We plot the relative expression of flu genes among the infected humanplusflu cells. Note that this distribution of expressions is reasonably similar to that in Russell et al (2018) in terms of the heirarchy of which genes are most highly expressed. Reasons that the values are not exactly identical could include: we have applied slightly different filters to call infected cells, are looking at a different timepoint, and have enriched for IFN+ cells.
p_flu_rel_expr <- ggplot(
pData(cells) %>%
filter(infected) %>%
mutate_at(sapply(flugenes, function (x) paste0(x, "_mRNAs")),
return) %>%
select(CellBarcode, flu_mRNAs, flugenes) %>%
gather(gene, counts, flugenes) %>%
transform(gene=factor(gsub("flu", "", gene), flugenes_noprefix)) %>%
mutate(frac=counts / flu_mRNAs),
aes(gene, frac)) +
geom_boxplot(notch=TRUE, outlier.size=1, outlier.alpha=0.4,
fill=cbPalette[3], outlier.color=cbPalette[3]) +
ylab("frac viral mRNA") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x=element_blank())
saveShowPlot(p_flu_rel_expr, width=3.5, height=3)
Cells were infected with a mix of wildtype and synonymously barcoded viruses. Here we call whether they are infected with viruses with just one barcode, or are co-infected. Note that we expect to only identify about half (actually a few less, since the barcode mixing is not exactly even, see above) of the truly co-infected cells this way, since some will be co-infected with the same barcode.
First, we annotated the number of called flu barcodes and the "purity" of these barcoded flu reads. The purity if the fraction of annotated barcodes that come from the more abundant variant (wildtype or synonymously barcoded). The purity can therefore range from 0.5 (equal mix of both virus barcode variants) to 1 (only one barcode).
We then calculate the P-value with which we can reject the hypothesis that the barcodes are drawn from a distribution with a mean purity of at least 95%. We then call as co-infected all cells that are infected and for which this P-value is $<10^{-3}$. We also annotate each cell by whether it is infected with each barcode.
Finally, we print a table of the number of cells that we can call as coinfected this way, and also plot the cells that we called as coinfected versus their number of annotated barcodes and purity:
pData(cells)$n_annotated_flu <- pData(cells)$annotated_flu_wt +
pData(cells)$annotated_flu_syn
pData(cells)$n_major_barcode <- pmax(pData(cells)$annotated_flu_wt,
pData(cells)$annotated_flu_syn)
pData(cells)$purity <- pData(cells)$n_major_barcode /
pData(cells)$n_annotated_flu
pData(cells)$coinfected_P <- map2(
pData(cells)$n_major_barcode,
pData(cells)$n_annotated_flu,
function(x, n) ifelse(n == 0, 1, binom.test(x, n, 0.95, "less")$p.value)
)
# annotate whether coinfected
pData(cells)$coinfected <- pData(cells)$infected &
(pData(cells)$coinfected_P < 1e-3)
# annotate whether infected by each barcode
pData(cells)$infected_wt <- pData(cells)$infected &
(pData(cells)$coinfected |
(pData(cells)$annotated_flu_wt >= pData(cells)$annotated_flu_syn))
pData(cells)$infected_syn <- pData(cells)$infected &
(pData(cells)$coinfected |
(pData(cells)$annotated_flu_syn >= pData(cells)$annotated_flu_wt))
pData(cells) %>%
filter(infected) %>%
group_by(coinfected) %>%
summarize(ncells=n(),
mean_frac_flu=mean(frac_mRNA_from_flu))
p_coinfected <- ggplot(pData(cells) %>% filter(infected),
aes(x=n_annotated_flu, y=purity, color=coinfected)) +
geom_point() +
scale_color_manual(values=cbbPalette[2:3])
saveShowPlot(p_coinfected, 4.5, 3)
We obviously only observe the co-infected cells with a mix of both barcodes.
We can use the multiplet_freq function defined above (and taken from Bloom (2018, DOI 10.1101/293639)) to estimate the actual co-infection frequency assuming Poisson infection:
tab_coinfect <- pData(cells) %>%
filter(infected) %>%
mutate(viral_barcode=ifelse(coinfected, "both",
ifelse(infected_wt, "wt", "syn"))) %>%
group_by(viral_barcode) %>%
summarize(ncells=n()) %>%
spread(key=viral_barcode, value=ncells) %>%
mutate(coinfection_freq=multiplet_freq(wt + both, syn + both, both))
Plot the numbers and coinfection frequency:
p_coinfect <- ggplot(
tab_coinfect %>%
select(-coinfection_freq) %>%
gather(barcode, ncells) %>%
transform(barcode=factor(barcode, c("wt", "syn", "both"))) %>%
mutate(coinfected=ifelse(barcode == "both", TRUE, FALSE)),
aes(x=barcode, y=ncells, fill=coinfected)) +
geom_bar(width=0.75, position="dodge", stat="identity") +
geom_text(aes(label=ncells), hjust=0.5, vjust=1.4, color='white') +
scale_fill_manual(values=c(cbPalette[3], cbPalette[6])) +
scale_y_continuous(name="number of cells",
expand=expand_scale(c(0.02, 0.2))) +
scale_x_discrete(name="viral barcode") +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
text=element_text(size=14),
axis.text=element_text(size=12),
legend.position='none') +
annotate("text", x=3.5, y=180,
label=sprintf(
paste("dual-barcode coinfection: %.0f%%",
"estimated coinfection rate: %.0f%%",
sep='\n'),
100 * tab_coinfect$both / (tab_coinfect$wt +
tab_coinfect$syn + tab_coinfect$both),
100 * tab_coinfect$coinfection_freq),
hjust=1,
vjust=0.4,
fontface="italic")
saveShowPlot(p_coinfect, width=2.9, height=2.2)
An interesting feature of this analysis is that even though the results above indicate that less than a quarter of cells are called infected, more than half are coinfected. This indicates that there is some enrichment for coinfected cells beyond what would be expected under a Poisson process. This could be because infection is not Poisson, because coinfection increases our ability to detect coinfection (such as by increasing the mRNA from an infecting virus with a defective polymerase), or because the interferon sorting enriched for coinfected cells among the infected cells.
We now want to identify the cells that are interferon (IFN) positive.
A549 cells induce both type I and type III IFN in response to viral infection. In addition, autocrine and paracrine signaling induce expression of various IFN-stimulated genes (ISGs).
We get all type I and type III interferons by their names, and select a few ISGs that we know have high expression. Below we print the names of the genes these IFNs and ISGs:
isg_genes <- c("IFIT1", "ISG15", "CCL5", "MX1")
ifn_genes <- fData(cells) %>%
select(gene_short_name) %>%
filter(str_detect(gene_short_name, "IFN[ABL]") | gene_short_name %in% isg_genes,
!str_detect(gene_short_name, "IFN.R")) %>%
mutate(genetype=ifelse(gene_short_name %in% isg_genes, "ISG",
ifelse(str_detect(gene_short_name, "IFNL"),
"type_III_IFN", "type_I_IFN"))) %>%
arrange(genetype, gene_short_name)
ifn_genes %>%
group_by(genetype) %>%
mutate(genes=paste0(gene_short_name, collapse=", ")) %>%
select(genetype, genes) %>%
filter(row_number() == 1)
Now for each cell, we calculate:
For the fraction calculations, we add a pseudocount to the number of IFN or ISG mRNAs. This fraction is therefore an upper bound on the amount of each type of mRNA, and also allows us to avoid taking the logarithm of zero.
pseudocount <- 1
for (gtype in ifn_genes$genetype %>% unique) {
genes <- ifn_genes %>% filter(genetype == gtype) %$% gene_short_name
pData(cells)[[gtype]] <- Matrix::colSums(exprs(
cells[row.names(subset(fData(cells), gene_short_name %in% genes)), ]))
pData(cells)[[paste0(gtype, "_frac")]] <- (pData(cells)[[gtype]] +
pseudocount) / pData(cells)$cell_mRNAs
pData(cells)[[paste0("log_", gtype, "_frac")]] <- log10(pData(cells)[[paste0(gtype, "_frac")]])
}
We plot the correlation between relative expression measure (fraction of cellular mRNA) for type I and type III IFN:
p_ifn_genes_corr <- ggplot(
pData(cells) %>%
mutate(infected=ifelse(infected, "infected", "not infected")),
aes(x=type_III_IFN_frac, y=type_I_IFN_frac, color=infected)) +
geom_point(alpha=0.5, size=2) +
stat_cor(color='black', fontface='italic') +
scale_color_manual(values=rev(rev(cbPalette[3:4]))) +
facet_wrap(~infected, nrow=1) +
theme(legend.position="none", strip.text=element_text(size=14)) +
xlab("frac mRNA from IFN III") +
ylab("frac mRNA from IFN I")
saveShowPlot(p_ifn_genes_corr, width=4.7, height=2.5, isfig=TRUE)
The plot above shows that the expression of type I and type III IFNs is highly correlated. Also, most cells expressing either of these IFNs are clearly flu infected (the others must either not be infected or just have such low amounts of flu that we failed to call them as infected).
Because type I and type III IFN are so correlated, we will define a new variable which is just the sum of type I and type III IFN expressions, and use this to call IFN status. Note that this variable mostly just reflects type III IFN levels since these are $\approx$10-fold higher than type I levels. We also compute the fraction and log10 fraction of mRNA from IFN, again adding a pseudocount to avoid logs of zero:
pData(cells)$IFN <- pData(cells)$type_I_IFN +
pData(cells)$type_III_IFN
pData(cells)$IFN_frac <- (pData(cells)$IFN +
pseudocount) / pData(cells)$cell_mRNAs
pData(cells)$log_IFN_frac <- log10(pData(cells)$IFN_frac)
Based on the plots above, we will call IFN status based just on the total fraction of mRNA from IFN (type I and type III combined). We don't use the ISGs because we are interested in cells that directly activate IFN due to viral infection, not cells that express ISGs possibly due to paracrine signaling.
We will classify cells into three IFN status groups:
pData(cells)$log_IFNneg_frac_threshold <- pData(cells) %>%
mutate(threshold=log10(2 / min(cell_mRNAs))) %$%
threshold
pData(cells)$log_IFNpos_frac_threshold <- pData(cells) %>%
mutate(threshold=log10(5 / min(cell_mRNAs))) %$%
threshold
cat(sprintf(paste(
"Calling IFN- if <=%.2e%% of mRNA from IFN",
"and IFN+ if >=%.2e%% from IFN."),
10**(pData(cells) %>% head(n=1) %$% log_IFNneg_frac_threshold),
10**(pData(cells) %>% head(n=1) %$% log_IFNpos_frac_threshold)))
pData(cells)$IFN_state <- pData(cells) %>%
mutate(IFN_state=ifelse(log_IFN_frac >= log_IFNpos_frac_threshold, "IFN+",
ifelse(log_IFN_frac <= log_IFNneg_frac_threshold, "IFN-",
"IFNambiguous"))) %$%
IFN_state
Below we plot cells in these classifications. It is immediately obvious that there are many more IFN+ cells among the infected ones, which is as we expect if infection is what is inducing IFN.
# histograms of IFN expression
p_ifn_dist <- ggplot(
pData(cells) %>%
mutate(
infected=ifelse(infected, "infected", "not infected"),
fill=ifelse(IFN_state %in% c("IFN+", "IFNambiguous"), IFN_state,
ifelse(infected == "infected", "IFN-infected", "IFN-uninfected"))
),
aes(log_IFN_frac, fill=fill)) +
geom_histogram(bins=37) +
ylab("number of cells") +
scale_fill_manual(values=c(cbPalette[3], cbPalette[4],
cbPalette[2], cbPalette[1])) +
geom_vline(aes(xintercept=log_IFNneg_frac_threshold),
color=cbbPalette[[8]], linetype='dashed') +
geom_vline(aes(xintercept=log_IFNpos_frac_threshold),
color=cbbPalette[[8]], linetype='dashed') +
facet_wrap(~ infected, scales="free_y", ncol=1) +
theme(legend.position="none", strip.text=element_text(size=14)) +
scale_x_continuous(name="frac mRNA from IFN",
labels=function (x) fancy_scientific(10**x))
# numbers to annotate plot
n_infected <- pData(cells) %>% filter(infected) %>% nrow
n_uninfected <- pData(cells) %>% filter(! infected) %>% nrow
n_infected_IFN <- pData(cells) %>%
filter(infected & IFN_state == "IFN+") %>% nrow
n_uninfected_IFN <- pData(cells) %>%
filter(! infected & IFN_state == "IFN+") %>% nrow
# annotate plot
p_ifn_dist <- ggdraw() +
draw_plot(p_ifn_dist, 0, 0, 1, 1) +
draw_label(
paste0(n_infected_IFN, " of ", n_infected, "\nIFN+"),
x=0.95, y=0.85, size=12, hjust=1, fontface='italic') +
draw_label(
paste0(n_uninfected_IFN, " of ", n_uninfected, "\nIFN+"),
x=0.95, y=0.43, size=12, hjust=1, fontface='italic')
saveShowPlot(p_ifn_dist, 3, 3.5)
# print statistics
pData(cells) %>%
group_by(infected, IFN_state) %>%
summarize(cells=n())
The plots above show that (as we expect), there is a much higher fraction of IFN+ cells among the clearly virally infected ones.
What about the rare handful of IFN+ cells that we did not call as infected? Maybe our experiment sorted out cells that are spontaneously activating IFN or activating it in response to some damage-associated molecular pattern (DAMP) from the infected cells--or maybe some of the cells we call as un-infected are fact infected at very low levels that induces IFN, but just didn't pass our filters for being clearly infected.
Now we make a plot similar to the one above but for expression of ISGs rather than IFN. Note that by "ISG" we mean the total mRNA counts for the four prototype ISGs defined above.
Since our downstream analyses don't actually do anything with ISG expression, we spend less time worrying about how this is called. We just make a heuristic cutoff of saying a cell is ISG+ if it has more than 0.1% of its mRNA from the four ISG genes.
We then plot the distribution of ISG expression for both infected and uninfected cells. We see that for both infected and uninfected there are more ISG+ than IFN+ cells, but the increase is much larger (10-fold) for uninfected cells than it is for infected ones (2-fold). In both cases, presumably paracrine signaling is leading to ISG expression in the absence of IFN.
# heuristic cutoff for calling ISG+
ISG_cutoff <- 1e-3
# classify cells based on heuristic threshold
pData(cells)$ISG_state <- ifelse(pData(cells)$ISG_frac > ISG_cutoff, "ISG+", "ISG-")
# make histogram
p_isg_dist <- ggplot(
pData(cells) %>%
mutate(
infected=ifelse(infected, "infected", "not infected"),
fill=ifelse(ISG_state == "ISG+", "ISG+",
ifelse(infected == "infected", "ISG-infected", "ISG-uninfected"))
),
aes(log_ISG_frac, fill=fill)) +
geom_histogram(bins=25) +
geom_vline(aes(xintercept=log10(ISG_cutoff)),
color=cbbPalette[[8]], linetype='dashed') +
ylab("number of cells") +
scale_fill_manual(values=c(cbPalette[3], cbPalette[4], cbPalette[7])) +
facet_wrap(~ infected, scales="free_y", ncol=1) +
theme(legend.position="none", strip.text=element_text(size=14)) +
scale_x_continuous(name="frac mRNA from ISG",
labels=function (x) fancy_scientific(10**x))
# numbers to annotate plot
n_infected_ISG <- pData(cells) %>%
filter(infected & ISG_frac > ISG_cutoff) %>% nrow
n_uninfected_ISG <- pData(cells) %>%
filter(! infected & ISG_frac > ISG_cutoff) %>% nrow
# annotate plot
p_isg_dist <- ggdraw() +
draw_plot(p_isg_dist, 0, 0, 1, 1) +
draw_label(
paste0(n_infected_ISG, " of ", n_infected, "\nISG+"),
x=0.95, y=0.85, size=12, hjust=1, fontface='italic') +
draw_label(
paste0(n_uninfected_ISG, " of ", n_uninfected, "\nISG+"),
x=0.95, y=0.43, size=12, hjust=1, fontface='italic')
saveShowPlot(p_isg_dist, 3, 3.5)
We next plot the correlation between ISG and IFN expression:
p_isg_corr <- ggplot(
pData(cells) %>%
mutate(infected=ifelse(infected, "infected", "not infected"),
fill=ifelse(ISG_state == "ISG+", "ISG+",
ifelse(infected == "infected", "ISG-infected",
"ISG-uninfected"))),
aes(x=log_ISG_frac, y=log_IFN_frac, color=fill)) +
geom_point(alpha=0.5, size=2) +
stat_cor(color='black', fontface='italic', label.sep='\n') +
scale_color_manual(values=c(cbPalette[3], cbPalette[4], cbPalette[7])) +
facet_wrap(~infected, ncol=1) +
theme(legend.position="none", strip.text=element_text(size=14)) +
scale_x_continuous(name="frac mRNA from ISG",
labels=function (x) fancy_scientific(10**x)) +
scale_y_continuous(name="frac mRNA from IFN",
labels=function (y) fancy_scientific(10**y))
saveShowPlot(p_isg_corr, width=2.7, height=4.5)
This plot confirms that ISG expression is reasonably correlated with IFN expression in infected cells (although there are some high ISG cells without IFN), but that it is not well correlated in uninfected cells. Therefore, most of the uninfected cells expression ISGs are probably doing so due to paracrine signaling.
We merge the two plots above into a single figure for the paper:
p_isg <- plot_grid(p_isg_dist, p_isg_corr,
nrow=1, align='h', scale=0.9, rel_widths=c(1.25, 1),
labels=c("A", "B"), label_size=18, hjust=-1)
saveShowPlot(p_isg, width=6.3, height=4.5, isfig=TRUE)
Now we are going to use monocle's capabilities for clustering the cells. The goal is simply to visualize what genes explain the structure of the data.
First, we estimate the size factors and dispersions. Note that for host differential gene expression of factors associated with influenza burden, there is some question of whether this should be done on cellular mRNAs only or on all mRNAs, since the viral mRNAs displace cellular mRNAs and so perhaps should not be considered for size factors. But for clustering, we want to use all the genes including the viral ones, so we estimate size factors on everything. We also annotate the number of cells in which each gene is detectably expressed. All of these operations are directly out of the monocle vignette.
cells <- estimateSizeFactors(cells)
cells <- estimateDispersions(cells)
cells <- detectGenes(cells, min_expr=0.1)
We also make sure that the expression values are lognormally distributed as recommended by the monocle vignette. The plot below shows that yes, they are close.
p_expr_lognorm <- ggplot(
exprs(cells) %>%
log %>%
(Matrix::t) %>%
scale %>%
(Matrix::t) %>%
melt,
aes(value)) +
geom_density(color=cbbPalette[[2]]) +
stat_function(fun=dnorm, size=0.5, color=cbbPalette[[3]]) +
xlab("standardized log(mRNAs)")
saveShowPlot(p_expr_lognorm, 3, 2)
We now select reasonalby highly expressed genes to use for clustering, plot their expression and dispersion, and plot the variance explained by each component. This is all paralleling the monocle vignette.
unsup_clustering_genes <- dispersionTable(cells) %>% subset(mean_expression > 0.5)
cells <- setOrderingFilter(cells, unsup_clustering_genes$gene_id)
p_ordering_genes <- plot_ordering_genes(cells)
saveShowPlot(p_ordering_genes, 3.5, 3)
p_variance_explained <- plot_pc_variance_explained(cells, return_all=FALSE)
saveShowPlot(p_variance_explained, 3.5, 3)
Now we do the tSNE clustering.
cells <- reduceDimension(cells, max_components=2, num_dim=6, reduction_method='tSNE')
cells <- clusterCells(cells)
And now we plot the cells color by the amount of flu mRNA, IFN expression, and expression of our selected ISGs. We see that flu expression and ISG expression explain a large amount of the structure in the data, as there is a distinct group of flu positive cells, and then a distinct group of ISG-expressing cells which partially overlaps with flu positive cells.
pData(cells)$log_frac_mRNA_from_flu <- log10(pData(cells)$frac_mRNA_from_flu)
tsne_plotlist <- map(
c("log_frac_mRNA_from_flu", "log_IFN_frac", "log_ISG_frac",
"frac_mRNA_from_flu", "IFN_frac", "ISG_frac"),
function (attr) {
plot_cell_clusters(cells, color=attr) +
scale_color_gradient(
low=cbPalette[1],
high=ifelse(str_detect(attr, "flu"), cbPalette[3],
ifelse(str_detect(attr, "IFN"), cbPalette[2], cbPalette[7])),
name=gsub("frac mRNA from flu", "flu frac",
gsub("_", " ",
gsub("log", "log10", attr)))) +
guides(color=guide_colorbar(title.position="top", title.hjust=0.5,
barwidth=10, barheight=0.8), alpha=NULL)
}
)
ncol <- 3
p_tsne <- plot_grid(plotlist=tsne_plotlist, ncol=ncol, scale=0.95)
saveShowPlot(p_tsne,
width=3.5 * min(length(tsne_plotlist), ncol),
height=4.2 * ceiling(length(tsne_plotlist) / ncol),
isfig=TRUE)
The PacBio sequencing allows us to analyze viral genotypes, using the cell annotations added by the pacbio_analysis.ipynb notebook. This is different than the 10X data above, which simply counts the 3' ends of transcripts.
We obtain viral genotypes only for infected cells, so get these:
infected_cells <- cells[, pData(cells)$infected]
One piece of information that the PacBio sequencing provides that the 10X does not is the identity of different isoforms of the two alternatively spliced genes (M1 / M2 and NS1 / NS2).
Note that the PacBio sequencing counts are expected to be much less good at quantitating levels of different transcripts due to the many rounds of PCR that are performed to get enough material for library prep.
Nonetheless, we can check if the levels M1 / M2 and NS1 / NS2 (or any other PacBio counted gene) is associated with IFN state, again only looking at cells that we can unambiguously call as IFN+ or IFN-. We do that below. We repeat the finding from 10X that there is a (non-significant) association between low NS1 and IFN+, and of high PB2 and IFN+. But there is nothing obvious with NS2 and M2, so it does not appear that the abundance of these different isoforms is a major factor:
# the isoforms of all flu genes called by PacBio
isoforms <- c("fluPB2", "fluPB1", "fluPA", "fluHA", "fluNP",
"fluNA", "fluM1", "fluM2", "fluNS1", "fluNS2")
# data frame giving fraction of PacBio from each isoform
isoform_counts <- pData(infected_cells) %>%
filter(IFN_state %in% c("IFN+", "IFN-")) %>%
gather(gene_barcodes, n_PacBio,
lapply(isoforms, function (i) c(paste0(i, "_wt_n_PacBio_seqs"),
paste0(i, "_syn_n_PacBio_seqs")))
%>% unlist) %>%
separate(gene_barcodes, "isoform", sep="_", extra="drop") %>%
transform(isoform=factor(isoform, isoforms)) %>%
group_by(CellBarcode, isoform, IFN_state) %>%
summarize(n_PacBio_isoform=sum(n_PacBio)) %>%
group_by(CellBarcode) %>%
mutate(n_PacBio_cell=sum(n_PacBio_isoform),
frac_PacBio=n_PacBio_isoform / n_PacBio_cell) %>%
ungroup
p_isoformfrac_IFN <- ggplot(isoform_counts, aes(IFN_state, frac_PacBio)) +
geom_boxplot(outlier.color=NA) +
geom_jitter(width=0.25, alpha=0.2, size=1) +
facet_wrap(~isoform, nrow=1) +
stat_compare_means(method="wilcox.test",
aes(label = paste0("P = ", ..p.format..)), vjust=0.5)
saveShowPlot(p_isoformfrac_IFN, 11, 3)
We also take a detailed look at the relationship just between the two alternate isoform pairs (M1 / M2 and NS1 / NS2) in the IFN+ and IFN- cells. That plot is below. Again, it indicates that altered ratios of these two isoforms does not appear to explain IFN induction in our experiments, and the relationship looks similarly linear between the two isoforms for both IFN+ and IFN- cells:
isoform_scatter_plots <- lapply(
list(c("fluM1", "fluM2"), c("fluNS1", "fluNS2")),
function (gene_isoforms)
ggplot(
isoform_counts %>%
select(-frac_PacBio) %>%
filter(isoform %in% gene_isoforms) %>%
spread(key=isoform, value=n_PacBio_isoform),
aes_string(gene_isoforms[[1]], gene_isoforms[[2]], color='IFN_state')) +
geom_point(alpha=0.4) +
geom_smooth(method='lm', se=FALSE) +
scale_color_manual(values=rev(cbbPalette[2:3]))
)
p_isoform_scatter <- plot_grid(plotlist=isoform_scatter_plots, nrow=1)
saveShowPlot(p_isoform_scatter, 7, 2.5)
Given these results, we don't use the PacBio to try to quantify any type of gene levels any further, as beyond M1 / M2 and NS1 / NS2, the 10X data should be vastly better at quantifying gene expression levels.
We now want to determine for which cells we can call the full viral genotype. By calling the genotype, we mean using the PacBio sequencing to identify the mutations present in all viral genes that the 10X sequencing says are present in that cell.
First, we need to determine from the 10X sequencing for which viral genes each cell should have. We have already narrowed ourselves down to infected cells, and then called each these cells by whether they are co-infected both viral barcodes, and whether they have each flu gene. However, we haven't yet called whether each cell has each flu gene from each viral barcode.
We do this below. We say that a cell should have a particular gene and viral barcode if:
# the two viral barcodes
viral_barcodes <- c("wt", "syn")
# has gene barcode if coinfected, has gene, and gene barcode frac > this
has_barcode_min_frac <- 0.25
# get total number mRNAs
n_mRNA <- pData(infected_cells)$total_mRNAs
# annotate cells by whether they have barcoded gene
for (gene in flugenes) {
# get background frac flu for this gene
gene_background_frac <- thresholds$frac[thresholds$gene == gene]
# get number of barcode annotated reads for gene
n_gene <- pData(infected_cells)[[paste('annotated', gene, 'wt', sep='_')]] +
pData(infected_cells)[[paste('annotated', gene, 'syn', sep='_')]]
# fraction annotated
frac_annotated <- pData(infected_cells)$n_annotated_flu / pData(infected_cells)$flu_mRNAs
for (bc in viral_barcodes) {
# get number reads for barcoded gene
n_gene_bc <- pData(infected_cells)[[paste('annotated', gene, bc, sep='_')]]
# call whether barcoded gene present
pData(infected_cells)[paste('has', gene, bc, sep='_')] <- (
pData(infected_cells)[paste('infected', bc, sep='_')] &
pData(infected_cells)[paste('has', gene, sep='_')] &
(((n_gene_bc / n_gene) > has_barcode_min_frac) |
(map2(n_gene_bc, gene_background_frac * n_mRNA * frac_annotated, poisson_P)
< p_has_gene_cutoff))
) %>% as.vector
}
}
Now we make a tidy data frame (called genotypes) with the gene presence and mutation data. We add the following columns for each cell:
We do this only for cells for which we can unambiguously call the IFN state, as we care about analyzing how mutations associate with IFN state.
# types of mutations
mutation_types <- c("substitutions", "deletions", "insertions")
# vector with all combinations of flu genes and viral barcodes
genes_barcodes <- lapply(flugenes, function (g) lapply(viral_barcodes,
function (bc) paste(g, bc, sep='_'))) %>% unlist
# vector with all combinations of flu genes, viral barcodes, mutations
genes_barcodes_mutations <- lapply(genes_barcodes, function(g_bc) lapply(
mutation_types, function(m) paste(g_bc, m, sep='_'))) %>% unlist
# tidy data frame with genotype information
genotypes <- pData(infected_cells) %>%
# make tidy for barcoded genes expected by 10X
gather(key="gene_barcode", value="barcoded_gene_present",
lapply(genes_barcodes, function (x) paste0('has_', x)) %>% unlist) %>%
separate(gene_barcode, c("dummy", "flu_gene", "viral_barcode")) %>%
select(-dummy) %>%
# make tidy for PacBio data for barcoded genes
gather(key="gene_barcode_mutation", value="mutations",
genes_barcodes_mutations) %>%
separate(gene_barcode_mutation, c("flu_gene2", "viral_barcode2", "mutation_type")) %>%
filter(flu_gene == flu_gene2 & viral_barcode == viral_barcode2) %>%
select(-flu_gene2, -viral_barcode2) %>%
# make flu_gene a factor
transform(flu_gene=factor(flu_gene, flugenes)) %>%
# add nice label for coinfection state
mutate(coinfection_state=factor(
ifelse(coinfected, "dual-barcode coinfection", "single viral barcode"),
c("single viral barcode", "dual-barcode coinfection")))
# make sure we have expected number of rows
stopifnot(nrow(genotypes) == (nrow(pData(infected_cells)) *
length(flugenes) * length(viral_barcodes) * length(mutation_types)))
# only cells with unambiguous IFN_state
genotypes <- genotypes %>% filter(IFN_state %in% c("IFN+", "IFN-"))
Now we want to identify gene / barcode combinations that based on the 10X data we called as being present, but for which we do not have a PacBio consensus. First, annotate these with the missing_PacBio column in the data frame:
genotypes <- genotypes %>%
mutate(missing_PacBio=barcoded_gene_present & is.na(mutations)) %>%
# add number of missing gene / barcode combinations per cell
group_by(CellBarcode) %>%
mutate(n_missing_gene_barcode=sum(missing_PacBio) / length(mutation_types)) %>%
ungroup
Now plot the distribution of the number of gene / barcode combinations that are missing PacBio data. Obviously, our goal is to get as many cells as possible to have zero expected genes missing:
p_n_missing <- ggplot(
genotypes %>% group_by(CellBarcode) %>% summarize_all(funs(dplyr::first)),
aes(n_missing_gene_barcode)) +
geom_bar() +
scale_x_continuous(breaks=pretty_breaks()) +
ylab("number of cells") +
xlab("number of PacBio genes / barcodes missing") +
facet_grid(IFN_state ~ coinfection_state, scales='free_y')
saveShowPlot(p_n_missing, width=5.5, height=4)
We also look to see which flu genes tend to be missing. Below we count up the number of times each flu genes is missing among the cells that are missing just one or two genes. We see that the genes that are missing most often are HA and NA. Therefore, if we were to do more PacBio sequencing, those would be the genes that it would be most productive to target more.
p_missing_by_gene <- ggplot(
genotypes %>%
filter(n_missing_gene_barcode <= 2) %>%
group_by(flu_gene, viral_barcode, IFN_state, coinfection_state) %>%
summarize(n_missing=sum(missing_PacBio / length(mutation_types))),
aes(flu_gene, n_missing, fill=viral_barcode)) +
geom_bar(position=position_dodge(), stat='identity') +
scale_fill_manual(values=cbPalette[2:3]) +
facet_grid(IFN_state ~ coinfection_state, scales='free_y') +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
saveShowPlot(p_missing_by_gene, width=7, height=3.5)
Now we annotate as complete the cells for which we can use the PacBio to call the complete viral genotype. By complete, we don't mean that all viral genes are present in the cell, but rather that we can use the PacBio to call the genotype of all genes that the 10X sequencing says is present.
We then plot the number of such cells below. As can be seen below, we are worse at calling these genotypes in dual-barcode co-infections.
genotypes <- genotypes %>%
mutate(complete=(n_missing_gene_barcode == 0),
complete_state=ifelse(complete,
"complete", "incomplete"))
n_complete_genotypes <- genotypes %>%
group_by(IFN_state, coinfection_state, complete_state) %>%
summarize(n_cells=length(unique(CellBarcode)))
p_n_complete_genotype <- ggplot(
genotypes %>%
group_by(IFN_state, coinfection_state, complete_state) %>%
summarize(n_cells=length(unique(CellBarcode))),
aes(complete_state, n_cells, fill=IFN_state, label=n_cells)) +
geom_bar(stat='identity', position=position_dodge()) +
geom_text(position=position_dodge(width=1), vjust=-0.3) +
facet_wrap(~ coinfection_state) +
scale_fill_manual(values=rev(cbPalette[2:3]), name=NULL) +
theme(
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
strip.text=element_text(size=14, margin=margin(b=3))) +
scale_y_continuous(name="number of cells",
limits=c(0, 1.15 * (n_complete_genotypes$n_cells %>% max))) +
xlab("viral genes in cell completely sequenced?")
saveShowPlot(p_n_complete_genotype, 6.1, 2.5)
p_frac_flu_complete_genotype <- ggplot(
genotypes %>%
group_by(CellBarcode, complete_state,
frac_mRNA_from_flu, IFN_state) %>%
summarize_all(funs(dplyr::first)),
aes(complete_state, frac_mRNA_from_flu)) +
geom_boxplot(aes(color=IFN_state), position=position_dodge(),
outlier.size=1, outlier.alpha=0.4) +
geom_boxplot(aes(fill=IFN_state), position=position_dodge(),
outlier.color=NA) +
facet_wrap(~ coinfection_state) +
scale_fill_manual(values=rev(cbPalette[2:3]), name=NULL) +
scale_color_manual(values=rev(cbPalette[2:3]), guide=FALSE) +
theme(strip.text=element_text(size=14, margin=margin(b=3))) +
scale_y_continuous(name="frac mRNA from flu") +
xlab("viral genes in cell completely sequenced?")
saveShowPlot(p_frac_flu_complete_genotype, 6.3, 2.35)
Merge the two plots above into a single plot for a figure:
p_cells_complete <- plot_grid(
p_n_complete_genotype +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.x=element_blank()),
p_frac_flu_complete_genotype +
theme(strip.background=element_blank(),
strip.text.x=element_blank()),
ncol=1, align='v', rel_heights=c(1, 1.25),
labels=c("A", "B"), label_size=18, vjust=1)
saveShowPlot(p_cells_complete, 6.3, 4.75, isfig=TRUE)
Now we examine which types of mutations are present as annotated in pacbio_analysis.ipynb. For all gene / barcode / mutation combinations, we tally up the number of sequences with mutations, both among cells that do and do not have complete sequencing of the viral genomes. We see below that substititutions are by far the most common type of mutation, then deletions, then insertions.
genotypes %>%
filter(barcoded_gene_present & ! is.na(mutations)) %>%
mutate(has_mutation=mutations != "WT") %>%
group_by(complete, mutation_type) %>%
summarize(wildtype=sum(! has_mutation), mutated=sum(has_mutation)) %>%
mutate(frac=mutated / wildtype) %>%
arrange(-complete, -mutated)
We create a gene_state variable that indicates if each barcoded gene variant is wildtype, absent, mutated, or has an unknown sequence:
genotypes <- genotypes %>%
mutate(gene_state=
ifelse(! barcoded_gene_present, "absent",
ifelse(missing_PacBio, "unknown",
ifelse(mutations == "WT", "wildtype", "mutated")))) %>%
transform(gene_state=factor(gene_state,
c("wildtype", "mutated", "absent", "unknown")))
We now want to plot and print the complete genotypes of all cells for which we could call all genes that are present.
This plot is fairly complex, so we have to do a lot of data cleaning to make it.
First, we number the cells with complete genomes (to provide succinct labels) in the genotypes_complete data frame, and arrange by the amount of viral mRNA:
genotypes_complete <- genotypes %>%
filter(complete) %>%
arrange(-frac_mRNA_from_flu) %>%
separate(CellBarcode, "CellBarcode_only", extra="drop", remove=FALSE) %>%
mutate(cell=paste("cell", match(CellBarcode_only, unique(CellBarcode_only)))) %>%
transform(cell=factor(cell, rev(unique(cell))))
Annotate each cell / gene by the number of viral barcodes for it (0 if gene not present, 1 if one barcoded variant, 2 if both barcoded variants):
n_viral_variants_df <- genotypes_complete %>%
filter(complete) %>%
select(cell, flu_gene, viral_barcode, barcoded_gene_present) %>%
distinct %>%
spread(viral_barcode, barcoded_gene_present) %>%
mutate(n_viral_variants=factor(as.integer(wt) + as.integer(syn)))
genotypes_complete <- merge(genotypes_complete,
n_viral_variants_df,
by=c('cell', 'flu_gene'))
Create a dataframe (genotypes_by_gene) that aggregates the data for viral barcodes for each gene and cell. In this dataframe, we no longer track the wildtype and synonymous barcodes separately, but rather simply track what is the genotype of all genes for that virus in that cell.
# get genotypes by gene (aggregating mutation types and viral barcodes)
genotypes_by_gene <- genotypes_complete %>%
dplyr::rename(cell_barcode=CellBarcode_only) %>%
# select just columns we care about
select(cell, cell_barcode, frac_mRNA_from_flu, IFN_frac, IFN_state,
coinfection_state, flu_gene, viral_barcode, mutation_type, mutations,
barcoded_gene_present, n_viral_variants) %>%
# annotate mutations into string
mutate(mutations=recode(mutations, "WT" = "")) %>%
spread(mutation_type, mutations) %>%
mutate(mutation_str=ifelse(! barcoded_gene_present,
paste(flu_gene, "missing", sep='-'),
ifelse(is.na(substitutions) | is.na(insertions) | is.na(deletions), "unknown",
ifelse(substitutions == "" & insertions == "" & deletions == "",
paste(flu_gene, "WT", sep='-'),
paste(substitutions, insertions, deletions))))) %>%
select(-barcoded_gene_present, -substitutions, -insertions, -deletions) %>%
verify(mutation_str != "unknown") %>%
# aggregate both viral barcodes and label mutations with long and short str
spread(viral_barcode, mutation_str) %>%
mutate(mutation_str_long=ifelse(syn == wt, wt,
ifelse(grepl("missing", syn), wt, ifelse(grepl("missing", wt), syn,
paste(wt, syn, sep=" / "))))) %>%
mutate(mutation_str_long=ifelse(mutation_str_long == paste(flu_gene, "WT", sep='-'),
"", mutation_str_long)) %>%
mutate(mutation_str_short=gsub("_\\(\\d+/\\d+\\)", "", mutation_str_long))
In preparation for plotting, create a data frame (flu_seq_coords) that gives the coordinates of all sites in the flu genome if we treat them as one continuous chromosome demarcated by gene boundaries. We do this reading in the vRNA lengths from ./data/flu_sequences/flu-wsn.fasta, and arrange the genes from longest to shortest as this is the convention for numbering flu gene segments:
# read in flu genes
flu_seqs <- readDNAStringSet("data/flu_sequences/flu-wsn.fasta")
# get start / end coords of genes
flu_seq_coords <- data.frame(flu_gene=names(flu_seqs), gene_length=width(flu_seqs)) %>%
separate(flu_gene, "flu_gene") %>%
transform(flu_gene=factor(flu_gene, flugenes)) %>%
mutate(gene_end=cumsum(gene_length), gene_start=gene_end - gene_length + 1,
gene_center=(gene_start + gene_end) / 2)
flu_seq_coords
Merge the genotypes_by_gene data frame with the sequence coordinates to create a new data frame (genotypes_by_plot) that gives the genotypic information in the coordinates used for the plotting below:
genotypes_to_plot <- genotypes_by_gene %>%
filter(! grepl("missing", mutation_str_short)) %>%
merge(flu_seq_coords, by="flu_gene") %>%
transform(IFN_state=factor(IFN_state))
Get the coordinates of substitution mutations into a data frame (substitution_coords) for use in the plotting below. We also get the frequency of each mutation in the PacBio CCSs so we can use that to adjust the plotting size:
substitution_coords <- genotypes_complete %>%
filter(barcoded_gene_present &
(! is.na(mutations)) &
mutations != "WT" &
mutation_type == "substitutions") %>%
mutate(mutations=strsplit(mutations, " ")) %>%
unnest(mutations) %>%
mutate(mutations=str_replace(mutations, paste0(flu_gene, '-'), '')) %>%
tidyr::extract(mutations,
c("site", "aa_subs", "freq_numerator", "freq_denominator"),
'[ACGT](\\d+)[ACTG]_?(.*)_\\((\\d+)/(\\d+)\\)',
remove=FALSE) %>%
mutate(aa_subs=aa_subs %>%
str_replace_all("flu", "") %>%
str_replace_all("_", " ") %>%
str_replace_all(c(Ala="A", Arg="R", Asn="N", Asp="D", Cys="C", Glu="E", Gln="Q",
Gly="G", His="H", Ile="I", Leu="L", Lys="K", Met="M", Phe="F",
Pro="P", Ser="S", Thr="T", Trp="W", Tyr="Y", Val="V"))
) %>%
mutate(
substitution_type=ifelse(nchar(aa_subs), "nonsynonymous", "synonymous"),
frequency=as.numeric(freq_numerator) / as.numeric(freq_denominator)
) %>%
merge(flu_seq_coords, by="flu_gene") %>%
mutate(site=gene_start + as.integer(site) - 1) %>%
select(cell, flu_gene, IFN_state, viral_barcode, n_viral_variants, site,
aa_subs, substitution_type, gene_start, gene_end, frequency, mutations)
Get the coordinates of deletion mutations into a data frame (deletion_coords) for use in the plotting below. Importantly, deletions are shown at widths representing their actual size unless they are very small, because in that case they are two narrow to see. So very small deletions are "padded" which makes them no longer at actual size (they are slightly larger), but at least visible.
deletion_coords <- genotypes_complete %>%
filter(barcoded_gene_present &
(! is.na(mutations)) &
mutations != "WT" &
mutation_type == "deletions") %>%
mutate(mutations=strsplit(mutations, " ")) %>%
unnest(mutations) %>%
mutate(mutations=str_replace(mutations, paste0(flu_gene, '-'), '')) %>%
tidyr::extract(mutations,
c("del_start", "del_end", "freq_numerator", "freq_denominator"),
'del(\\d+)to(\\d+)_\\((\\d+)/(\\d+)\\)',
remove=FALSE) %>%
merge(flu_seq_coords, by="flu_gene") %>%
mutate(del_start=as.integer(del_start) + gene_start - 1,
del_end=as.integer(del_end) + gene_start - 1) %>%
mutate(
mutation_type='deletion',
frequency=as.numeric(freq_numerator) / as.numeric(freq_denominator)
) %>%
select(cell, flu_gene, viral_barcode, n_viral_variants, IFN_state,
del_start, del_end, mutation_type, frequency, mutations)
# pad small deletions
min_width <- 15 # minimum width at which a deletion is shown
deletion_coords <- deletion_coords %>%
mutate(del_len=del_end - del_start,
del_padding=max(0, min_width - del_len),
del_start=del_start - del_padding / 2,
del_end=del_end + del_padding / 2)
The plotting is not currently set up to handle insertions since we do not have any in the current data set. The next cell throws an error and stops the notebook if an insertion is present. If we end up identifying an insertion in one of the genes, this cell and the plotting that follows it will need to be adjusted:
insertion_coords <- genotypes_complete %>%
filter(barcoded_gene_present &
(! is.na(mutations)) &
mutations != "WT" &
mutation_type == "insertions") %>%
mutate(mutations=strsplit(mutations, " ")) %>%
unnest(mutations) %>%
mutate(mutations=str_replace(mutations, paste0(flu_gene, '-'), '')) %>%
tidyr::extract(mutations,
c("ins_site", "ins_len", "freq_numerator", "freq_denominator"),
'ins(\\d+)len(\\d+)_\\((\\d+)/(\\d+)\\)',
remove=FALSE) %>%
merge(flu_seq_coords, by="flu_gene") %>%
mutate(ins_site=as.integer(ins_site) + gene_start - 1,
ins_len=as.integer(ins_len)) %>%
mutate(mutation_type='insertion',
frequency=as.numeric(freq_numerator) / as.numeric(freq_denominator)
) %>%
select(cell, flu_gene, viral_barcode, n_viral_variants, IFN_state,
ins_site, ins_len, mutation_type, frequency)
# pad small insertions
min_width <- 15 # minimum width at which an insertion is shown
insertion_coords <- insertion_coords %>%
mutate(ins_len=max(ins_len, min_width))
Now use gggenes to make a plot that visually displays the genotypes for all cells for which these are called. We also indicate the amount of total mRNA in the cell that comes from virus and IFN. This is a complex plot, so the code that creates it is fairly long.
First, write a function that plots the genotypes given a few arguments that will subsequently be customized:
arrow_height <- 0.3 # vertical space per arrow in inches
arrow_frac_height <- 0.55 # arrow takes up this much of available height
max_shape_size <- 4.6 # max size of substitutions, deletions, insertions
sub_size <- 0.55 # size of substitutions relative to `max_shape_size`
del_size <- 1 # size deletions relative to `max_shape_size`
ins_size <- del_size # size of segments for insertions
# Function to plot genotypes filtering for `col_name` %in% `col_val`.
# Expand y limits to make height for `ncells` cells.
# Add title `title`.
# Returns named list with `percent_flu`, `percent_IFN`, `genotypes`
plot_genotypes <- function(col_name, col_val, ncells=NULL, title='') {
# filter for cells of interest
p_genotypes_to_plot <- genotypes_to_plot %>%
filter(UQ(as.name(col_name)) %in% col_val)
p_substitution_coords <- substitution_coords %>%
filter(UQ(as.name(col_name)) %in% col_val)
p_deletion_coords <- deletion_coords %>%
filter(UQ(as.name(col_name)) %in% col_val)
p_insertion_coords <- insertion_coords %>%
filter(UQ(as.name(col_name)) %in% col_val)
# get number of cells if not specified
p_ncells <- p_genotypes_to_plot$cell %>% unique %>% length
if (is.null(ncells))
ncells <- p_ncells
# make space for title
if (nchar(title)) {
ncells <- ncells + 2
}
# make plot of genotypes
p_genes <- ggplot() +
# plot arrows for genes
geom_gene_arrow(data=p_genotypes_to_plot,
aes(xmin=gene_start, xmax=gene_end, y=cell, fill=n_viral_variants),
arrow_body_height=unit(arrow_frac_height * arrow_height, 'in'),
arrowhead_height=unit(arrow_frac_height * arrow_height, 'in'),
arrowhead_width=unit(0.3 * arrow_height, 'in'),
size=0.4) +
scale_fill_manual(values=rep(c(cbPalette[[3]], cbPalette[[6]]), 4),
name="viral barcodes") +
# plot points for substitutions
geom_point(data=p_substitution_coords %>%
filter(n_viral_variants == 1),
aes(x=site, y=cell, shape=substitution_type,
size=frequency * sub_size),
color=cbPalette[[7]], stroke=0.8) +
geom_point(data=p_substitution_coords %>%
filter(n_viral_variants == 2 & viral_barcode == "wt"),
aes(x=site, y=cell, shape=substitution_type,
size=0.5 * frequency * sub_size),
color=cbPalette[[7]], stroke=0.8,
position=position_nudge(y=0.4 * arrow_height)) +
geom_point(data=p_substitution_coords %>%
filter(n_viral_variants == 2 & viral_barcode == "syn"),
aes(x=site, y=cell, shape=substitution_type,
size=0.5 * frequency * sub_size),
color=cbPalette[[7]], stroke=1,
position=position_nudge(y=-0.4 * arrow_height)) +
scale_shape_manual(values=c(19, 1), name="point mutation") +
scale_size(range=c(0, max_shape_size)) +
guides(shape=guide_legend(override.aes=list(size=max_shape_size * sub_size)),
size='none') +
# plot lines for deletions
geom_segment(data=p_deletion_coords %>%
filter(n_viral_variants == 1),
aes(x=del_start, y=cell, xend=del_end, yend=cell,
color=mutation_type, size=frequency * del_size)) +
geom_segment(data=p_deletion_coords %>%
filter(n_viral_variants == 2 & viral_barcode == "wt"),
aes(x=del_start, y=cell, xend=del_end, yend=cell,
color=mutation_type, size=0.38 * frequency * del_size),
position=position_nudge(y=0.4 * arrow_height)) +
geom_segment(data=p_deletion_coords %>%
filter(n_viral_variants == 2 & viral_barcode == "syn"),
aes(x=del_start, y=cell, xend=del_end, yend=cell,
color=mutation_type, size=0.38 * frequency * del_size),
position=position_nudge(y=-0.4 * arrow_height)) +
scale_color_manual(values=c(cbPalette[[5]], cbPalette[[8]]),
name=NULL) +
guides(color=guide_legend(
override.aes=list(size=del_size * max_shape_size))) +
# plot lines for insertions
geom_segment(data=p_insertion_coords %>%
filter(n_viral_variants == 1),
aes(x=ins_site, y=cell, xend=ins_site + ins_len, yend=cell,
color=mutation_type, size=frequency * ins_size)) +
geom_segment(data=p_insertion_coords %>%
filter(n_viral_variants == 2 & viral_barcode == "wt"),
aes(x=ins_site, y=cell, xend=ins_site + ins_len, yend=cell,
color=mutation_type, size=0.38 * frequency * ins_size),
position=position_nudge(y=0.4 * arrow_height)) +
geom_segment(data=p_insertion_coords %>%
filter(n_viral_variants == 2 & viral_barcode == "syn"),
aes(x=ins_site, y=cell, xend=ins_site + ins_len, yend=cell,
color=mutation_type, size=0.38 * frequency * ins_size),
position=position_nudge(y=-0.4 * arrow_height)) +
# label amino-acid substitutions
geom_gene_label(data=p_substitution_coords %>%
group_by(cell, flu_gene, gene_start, gene_end) %>%
summarize(aa_subs=paste(aa_subs, collapse= " ")),
aes(label=aa_subs, xmin=gene_start, xmax=gene_end, y=cell),
size=6, min.size=2, reflow=TRUE, color='white',
padding.y=unit(0.02 * arrow_height, "in")) +
# other plot formatting
theme(axis.ticks.y=element_blank(), axis.line.y=element_blank(),
axis.text.y=element_blank(),
panel.grid.major.y=ggplot2::element_line(colour = "grey", size = 1),
plot.margin=unit(arrow_height * c(0, 1.2, 0.1, 0.4), "in")) +
scale_x_continuous(expand=c(0.001, 0), breaks=flu_seq_coords$gene_center,
labels=gsub("flu", "", flugenes)) +
expand_limits(y=c(0, ncells)) +
ylab(NULL) +
xlab("viral gene")
# add title
if (nchar(title))
p_genes <- p_genes + annotate('text',
x=(min(flu_seq_coords$gene_start) + max(flu_seq_coords$gene_end)) / 2,
y=p_ncells + 1, label=title, size=6)
# make plot of % flu
p_flu <- ggplot(p_genotypes_to_plot %>%
mutate(x="flu", percent=frac_mRNA_from_flu * 100),
aes(x=x, y=cell, fill=percent, label=sprintf("%.0f", percent))) +
geom_tile(color=cbPalette[[4]], height=arrow_frac_height, width=0.95, size=0.5) +
geom_text(size=3) +
scale_fill_gradient(low='white', high=cbPalette[[4]],
limits=c(0, 100 * max(genotypes_to_plot$frac_mRNA_from_flu))) +
expand_limits(y=c(0, ncells)) +
xlab("% mRNA") +
theme(axis.title.x=element_text(hjust=0.2)) +
ylab(NULL) +
theme(legend.position='none', axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
plot.margin=unit(arrow_height * c(0, 0, 0, 0.3), "in"))
# make plot of % IFN
p_IFN <- ggplot(p_genotypes_to_plot %>%
mutate(x="IFN", percent=IFN_frac * 100),
aes(x=x, y=cell, fill=percent, color=IFN_state, label=sprintf("%.1f", percent))) +
geom_tile(height=arrow_frac_height, width=0.95, size=0.5) +
geom_text(size=3, color=cbbPalette[1]) +
scale_fill_gradient(low='white', high=cbPalette[[2]],
limits=c(0, 100 * max(genotypes_to_plot$IFN_frac))) +
scale_color_manual(values=c(cbPalette[1], cbPalette[2]), drop=FALSE) +
expand_limits(y=c(0, ncells)) +
xlab(NULL) +
ylab(NULL) +
theme(legend.position='none', axis.line.y=element_blank(),
axis.text.y=element_blank(), axis.ticks.y=element_blank(),
plot.margin=unit(arrow_height * c(0, 0, 0, 0), "in"))
return(list(percent_flu=p_flu, percent_IFN=p_IFN, genotypes=p_genes))
}
First make a plot that shows all cells wrapped into a specified number of columns:
ncols <- 2
cell_names <- genotypes_to_plot$cell %>% unique %>% sort(decreasing=TRUE)
ncells <- cell_names %>% length
ncells_per_col <- ceiling(ncells / ncols)
p_cells_list <- lapply(
1:ncols,
function (i) {
col_val <- cell_names[((i - 1) * ncells_per_col + 1):
min(ncells, (i * ncells_per_col))]
plot_genotypes('cell', col_val, ncells_per_col, '')
}
)
p_genotypes <- plot_grid(
plotlist=
append(lapply(1:ncols,
function (i) plot_grid(
plotlist=lapply(c('percent_flu', 'percent_IFN', 'genotypes'),
function (p) p_cells_list[[i]][[p]] + theme(legend.position='none')
),
align='h',
rel_widths=c(0.115, 0.039, 1),
nrow=1
)
),
list(get_legend(p_cells_list[[ncols]]['genotypes']))
),
rel_widths=c(rep(1, ncols), 0.2),
nrow=1
)
saveShowPlot(p_genotypes,
width=9 * ncols + 2,
height=arrow_height * (2 + ncells_per_col),
isfig=TRUE)
Now make a plot that splits into two columns, one for IFN+ and one for IFN- cells:
ncells_per_col_ifn <- max(
genotypes_to_plot %>% filter(IFN_state == 'IFN+') %$%
cell %>% unique %>% length,
genotypes_to_plot %>% filter(IFN_state == 'IFN-') %$%
cell %>% unique %>% length)
p_ifn_pos <- plot_genotypes("IFN_state", c("IFN+"), ncells_per_col, "")
p_ifn_neg <- plot_genotypes("IFN_state", c("IFN-"), ncells_per_col, "")
p_genotypes_by_ifn <- plot_grid(
plot_grid(
p_ifn_neg[['percent_flu']],
p_ifn_neg[['percent_IFN']],
p_ifn_neg[['genotypes']] + theme(legend.position='none'),
align='h',
rel_widths=c(0.115, 0.039, 1),
nrow=1
),
plot_grid(
p_ifn_pos[['percent_flu']],
p_ifn_pos[['percent_IFN']],
p_ifn_pos[['genotypes']] + theme(legend.position='none'),
align='h',
rel_widths=c(0.115, 0.039, 1),
nrow=1
),
get_legend(p_ifn_neg[['genotypes']]),
rel_widths=c(1, 1, 0.2),
nrow=1
)
saveShowPlot(p_genotypes_by_ifn, width=20,
height=arrow_height * (3 + ncells_per_col),
isfig=TRUE)
The plots above summarize essentially all of the genotypic data. They look better if you open them and view at higher resolution.
We also have a table that gives the same information displayed in the plot above. However, it also has a few pieces of additional relevant data: each mutation is suffixed by the number of PacBio CCSs in which it is observed, and the exact identities of all mutations are shown. Below we show the first few lines, and then write the table to a file for the paper:
genotype_table <- genotypes_by_gene %>%
arrange(flu_gene) %>%
group_by(cell, cell_barcode, frac_mRNA_from_flu, IFN_frac, IFN_state, coinfection_state) %>%
summarize(mutations=paste0(trimws(mutation_str_long), collapse="; "),
mutations=gsub('(; )+', '; ', mutations),
mutations=gsub('; $', '', mutations),
mutations=gsub('^; ', '', mutations)) %>%
arrange(desc(cell))
genotype_table %>% head
genotype_table %>% write.csv(file.path(figsdir, 'genotypes.csv'), row.names=FALSE)
Even more extensive information (for instance, is the wt or syn viral barcode variant present in each cell) is in the full genotypes_by_gene data frame. That is too long to print in this notebook, but the cell below shows the first few lines as an illustration:
genotypes_by_gene %>%
select(-n_viral_variants, -mutation_str_short) %>%
arrange(desc(cell)) %>%
head
Finally, although we think the two viral barcodes are equivalent, we just confirm that among the IFN+ cells infected with wildtype virus, we don't see any bias towards one viral barcode. The analysis below confirms that there is no such bias: among IFN+ cells infected with wildtype, 3 have the wildtype viral variant, 4 have the synonymous viral variant, and 2 are co-infected:
genotypes_by_gene %>%
mutate(has_syn=! str_detect(syn, "missing"),
has_wt=! str_detect(wt, "missing")) %>%
group_by(cell, IFN_state) %>%
summarize(
has_syn=any(has_syn),
has_wt=any(has_wt),
mutations=paste0(trimws(mutation_str_long), collapse=""),
is_wildtype=('' == mutations)
) %>%
group_by(IFN_state, has_syn, has_wt, is_wildtype) %>%
summarize(ncells=n()) %>%
arrange(IFN_state, is_wildtype, has_syn, has_wt)
Examine the genetic features of viruses (gene absence, mutations) that associated with IFN expression.
Get a data frame (infection_data) that has all information of potential interest about each infected cell, and then has a row for each flu gene and mutation type (amino-acid substitutions, insertions, deletions, gene absence) that has a column has_mutation that is:
# data frame annotated with all mutation types
infection_data <- genotypes %>%
# get call barcode without suffix
separate(CellBarcode, "cell_barcode", extra="drop") %>%
# call mutation status, disregard synonymous mutations
tidyr::extract(mutations, c("aa_subs"), paste0('[ACGT]\\d+[ACTG]_',
'([:alnum:]+\\-+[:upper:][:lower:]{2}\\d+[:upper:][:lower:]{2})_\\(\\d+/\\d+\\)'),
remove=FALSE) %>%
mutate(has_mutation=
ifelse(! barcoded_gene_present, FALSE,
ifelse(is.na(mutations), NA,
ifelse(mutations == "WT", FALSE,
ifelse(mutation_type != "substitutions", TRUE,
ifelse(mutation_type == "substitutions" & is.na(aa_subs), FALSE, TRUE)))))) %>%
# get columns of interest
select(cell_barcode, total_mRNAs, frac_mRNA_from_flu,
sapply(flugenes, function (g) paste0(g, "_frac")) %>% as.vector,
sapply(flugenes, function (g) paste0("has_", g)) %>% as.vector,
n_flu_genes, ISG_frac, ISG_state, IFN_frac, IFN_state,
coinfection_state, complete_state, mutation_type, flu_gene,
viral_barcode, has_mutation, barcoded_gene_present) %>%
# aggregate data for both viral barcodes into gene_status
group_by(cell_barcode, flu_gene, mutation_type) %>%
mutate(has_mutation=
ifelse(any(is.na(has_mutation)), NA,
ifelse(any(has_mutation), TRUE, FALSE))) %>%
# gene presence / absence aggregated across viral barcodes
group_by(cell_barcode, flu_gene) %>%
mutate(gene_present=any(barcoded_gene_present)) %>%
select(-barcoded_gene_present, -viral_barcode) %>%
unique %>%
ungroup
# add gene absence as a mutation type
infection_data <- rbind(
infection_data,
infection_data %>%
mutate(mutation_type="gene absent",
has_mutation=!gene_present) %>%
unique
)
# indicate cells that are infected by fully wildtype virus
infection_data <- infection_data %>%
group_by(cell_barcode) %>%
mutate(wildtype_virus=ifelse(any(is.na(has_mutation)) | any(has_mutation),
"incomplete / mutated", "full wildtype")) %>%
ungroup
Write data frame to a file for paper. This data frame has information on mutation presence with a row for each gene and mutation type:
infection_data %>% head
infection_data %>% write.csv(file.path(figsdir, 'mutations.csv'), row.names=FALSE)
Then get a smaller data frame (cell_infection_data) that just has the cell-specific data in infection_data with one cell per row:
cell_infection_data <- infection_data %>%
select(-mutation_type, -flu_gene, -has_mutation, -gene_present) %>%
unique
We've already shown that expression of flu mRNA is highly heterogeneous across infected cells. Is part of the heterogeneity due to viral genetics? To assess this, we plot the distribution of viral mRNA among cells for which we obtained complete viral genomes, faceting by whether those cells are infected by fully wildtype virus. We quantify the heterogeneity in flu mRNA production by the Gini coefficient, which we here refer to using the alternative nomenclature of Gini index since it is shorter and fits on the plot better. Note that these plots are already conditioned on us obtaining a full genome of the infecting virus (which preferentially happens for cells with lots of viral mRNA), so this already reduced the heterogeneity below that observed in Russell et al (2018).
Nonetheless, conditioning on whether we have a full viral genome, the plots below shows that cells infected by wildtype viruses have less heterogeneity than cells infected by viruses with mutations / indel, indicating that viral genetics contributes to the heterogeneity. The plots show the 95% confidence intervals for the Gini coefficients:
# data frame with complete infections and Gini coefficients on frac flu
frac_flu_df <- cell_infection_data %>%
filter(complete_state == "complete") %>%
group_by(wildtype_virus) %>%
mutate(gini=Gini(frac_mRNA_from_flu, conf.level=0.95) %>%
(function (x) sprintf("%.2f (%.2f, %.2f)", x[1], x[2], x[3]))) %>%
ungroup
# make histogram
p_frac_flu_wt_hist <- ggplot(
frac_flu_df,
aes(frac_mRNA_from_flu)) +
geom_histogram(bins=20, fill=cbPalette[3]) +
facet_wrap(~ wildtype_virus, ncol=1) +
xlab("frac mRNA from flu") +
scale_y_continuous(name="number of cells", breaks=pretty_breaks(4)) +
expand_limits(y=c(0, 14)) +
theme(strip.text=element_text(size=14, margin=margin(b=3)))
# annotate with Gini coefficients
p_frac_flu_wt <- ggdraw() +
draw_plot(p_frac_flu_wt_hist, 0, 0, 1, 1) +
draw_label(
paste('Gini index',
frac_flu_df %>%
filter(wildtype_virus != "full wildtype") %>%
head(n=1) %$%
gini),
x=0.97, y=0.47, size=12, hjust=1, fontface='italic') +
draw_label(
paste('Gini index',
frac_flu_df %>%
filter(wildtype_virus == "full wildtype") %>%
head(n=1) %$%
gini),
x=.97, y=0.89, size=12, hjust=1, fontface='italic')
saveShowPlot(p_frac_flu_wt, 3, 3.8)
We now look to see if IFN induction is less common in cells infected with wildtype virus. We will then do a similar analysis with other variables, such as ISG levels and coinfection. So we create a general function that does the analysis by making an annotated histogram, generating a contingency table, and performing a Fisher's exact test.
# Plot faceted histogram for viruses with complete sequenced genomes,
# with numerical annotations on histogram.
# Returns list of plot, contingeny table, and Fisher's exact test result.
# Designed to show how IFN or ISG levels depend on a faceting variable.
# `ifn_or_isg`: plot IFN or ISG levels on x-axis
# `facet_var`: variable to facet on, `wildtype_virus` or `coinfection_state`
plot_hist <- function (ifn_or_isg, facet_var) {
df <- cell_infection_data %>%
filter(complete_state == "complete") %>%
mutate(frac=if (ifn_or_isg == "IFN") IFN_frac else ISG_frac,
state=if (ifn_or_isg == "IFN") IFN_state else ISG_state,
facet_var=if (facet_var == "wildtype_virus") wildtype_virus
else coinfection_state)
p_hist <- ggplot(
df,
aes(frac, fill=state)) +
geom_histogram(bins=42) +
facet_wrap(~ facet_var, ncol=1) +
xlab(paste("frac mRNA from", ifn_or_isg)) +
scale_y_continuous(name="number of cells", breaks=pretty_breaks(4)) +
scale_fill_manual(
values=if (ifn_or_isg == "IFN") rev(cbPalette[2:3]) else
c(cbPalette[3], cbPalette[7]),
name=NULL) +
theme(strip.text=element_text(size=14, margin=margin(b=3)),
legend.position=c(0.65, 0.89))
facet_vals <- df %>% arrange(facet_var) %$% facet_var %>% unique
# annotate with fractions
p <- ggdraw() +
draw_plot(p_hist, 0, 0, 1, 1) +
draw_label(
paste(df %>%
filter(facet_var != facet_vals[[1]] &
state == paste0(ifn_or_isg, "+")) %>%
nrow,
"of",
df %>%
filter(facet_var != facet_vals[[1]]) %>%
nrow,
paste0(ifn_or_isg, "+")),
x=0.3, y=0.47, size=12, hjust=0, fontface='italic') +
draw_label(
paste(df %>%
filter(facet_var == facet_vals[[1]] &
state == paste0(ifn_or_isg, "+")) %>%
nrow,
"of",
df %>%
filter(facet_var == facet_vals[[1]]) %>%
nrow,
paste0(ifn_or_isg, "+")),
x=0.3, y=0.89, size=12, hjust=0, fontface='italic')
contingency_table <- df %>%
group_by(facet_var, state) %>%
summarize(n=n()) %>%
spread(key=facet_var, value=n)
test <- fisher.test(contingency_table %>% select(-state))
return (list(p, contingency_table, test))
}
Now we look at the association of a cell being infected with full wildtype virus and being IFN+:
# perform analysis
p_frac_ifn_wt_list <- plot_hist(ifn_or_isg="IFN", facet_var="wildtype_virus")
# get and show histogram
p_frac_ifn_wt <- p_frac_ifn_wt_list[[1]]
saveShowPlot(p_frac_ifn_wt, 3, 3.8)
# show contingency table and Fisher's exact test result
p_frac_ifn_wt_list[[2]]
p_frac_ifn_wt_list[[3]]
We see that cells infected with wildtype virus are less likely to be IFN+, and that when they are they tend to express less IFN.
However, the difference in the number of IFN+ cells among those infected with wildtype virus is not statistically significant, probably because the numbers are so small.
We next repeat the above analysis to see if known dual-barcode coinfections are associated with IFN induction. The plots below show that known coinfected cells are just about as likely to be IFN+ as cells that are infected with just one viral barcode. Again, the caveat to this is that we only identify about half of the coinfections.
# perform analysis
p_frac_ifn_coinfect_list <- plot_hist(ifn_or_isg="IFN", facet_var="coinfection_state")
# get and show histogram
p_frac_ifn_coinfect <- p_frac_ifn_coinfect_list[[1]]
saveShowPlot(p_frac_ifn_coinfect, 3, 3.8)
# show contingency table and Fisher's exact test result
p_frac_ifn_coinfect_list[[2]]
p_frac_ifn_coinfect_list[[3]]
Now we check to see if the absence of any genes or mutations in any genes are associated with IFN induction. We use a Fisher's exact test to see if any genes are absent more in the IFN+ cells than the IFN- cells, and then use an FDR correction for multiple hypotheses.
For this analysis, we look at:
For each type of mutation, we calculate the percentage of cells that have that mutation in their gene in both the IFN+ and IFN- condition. For substitutions and deletions, we condition this analysis on the cell having the gene (i.e., we exclude from the calculations cells that are completely lacking the gene).
We also use Fisher's exact test to calculate the P-value that there is a difference in the frequency of the mutation among the IFN+ and IFN- cells.
First, we define a function that makes these plots. We generalize this function so that it can show varying levels of annotations and work for IFN or ISGs:
# function to plot association of mutations with IFN or ISG
# `complete_only`: only show cells with complete genomes sequenced
# `show_counts`: show numerical counts on plot
# `showQvalue`: show Q-values in addition to P-values
# `ifn_or_isg`: perform plotting for IFN or ISG?
plot_muts <- function (complete_only, show_counts, showQvalue, ifn_or_isg) {
mutation_data <- infection_data %>%
filter(mutation_type != "insertions") %>%
transform(
mutation_type=factor(mutation_type, c("gene absent", "deletions", "substitutions")),
flu_gene=factor(gsub("flu", "", flu_gene), flugenes_noprefix)
) %>%
filter(gene_present | mutation_type == "gene absent") %>%
filter(complete_state == "complete" | ! complete_only) %>%
filter(! is.na(has_mutation)) %>%
mutate(state=if (ifn_or_isg == "IFN") IFN_state else ISG_state) %>%
group_by(mutation_type, flu_gene) %>%
summarize(
pos_mutated=sum(has_mutation & state == paste0(ifn_or_isg, "+")),
neg_mutated=sum(has_mutation & state == paste0(ifn_or_isg, "-")),
pos=sum(state == paste0(ifn_or_isg, "+")),
neg=sum(state == paste0(ifn_or_isg, "-")),
pos_percent=100 * pos_mutated / pos,
neg_percent=100 * neg_mutated / neg,
P=fisher.test(
matrix(c(pos_mutated, pos - pos_mutated,
neg_mutated, neg - neg_mutated),
nrow=2),
)[["p.value"]],
) %>%
ungroup %>%
gather("state", "percent_mutated", pos_percent, neg_percent) %>%
mutate(state=str_replace(state, "pos_percent", "+"),
state=str_replace(state, "neg_percent", "-"),
denom=ifelse(state == "+", pos, neg),
num=ifelse(state == "+", pos_mutated, neg_mutated),
label=sprintf("%d /\n%d", num, denom),
Q=qvalue(P)$qvalue,
Plabel=if (showQvalue)
paste0(signif(P, 1), " / ", signif(Q, 1))
else signif(P, 1)
)
ymax <- mutation_data$percent_mutated %>% max
p <- ggplot(
mutation_data,
aes(state, percent_mutated, fill=state)) +
geom_bar(stat='identity') +
geom_text(aes(label=Plabel),
y=1.09 * ymax, x=1.5, fontface="italic", vjust=0, size=3) +
geom_segment(x=1, xend=2, y=1.05 * ymax, yend=1.05 * ymax, size=0.25) +
facet_grid(mutation_type ~ flu_gene) +
scale_fill_manual(values=
if (ifn_or_isg == "IFN") rev(cbPalette[2:3]) else
c(cbPalette[3], cbPalette[7])) +
theme(
legend.position='none',
strip.text=element_text(size=14, margin=margin(l=4, r=1)),
axis.text.x=element_text(size=14)
) +
expand_limits(y=c(0, 1.2 * ymax)) +
xlab(paste(ifn_or_isg, "state")) +
ylab("percent of cells with mutation")
if (show_counts)
p <- p + geom_text(aes(label=label),
y=0, vjust=-0.1, hjust=0.5, fontface="italic", size=3)
return (p)
}
Now make a plot showing just the cells for which we could completely sequence the infection virions. This plot shows that absence of NS is strongly associated with IFN induction, that amino-acid substitutions in PB1 are modestly corelated with IFN+, and that deletions in HA and substitutions in NS are perhaps weakly associated. The numbers of above the plots show the P value and the Q value. Only the NS absence association remains significant at an FDR of 0.1.
p_muts_ifn_complete <- plot_muts(
complete_only=TRUE, show_counts=FALSE,
showQvalue=TRUE, ifn_or_isg="IFN")
saveShowPlot(p_muts_ifn_complete, width=7, height=4.75)
Now make a version of the same plot where we no longer require the genomes of the infecting virions to be completely sequenced, but also include genes from partially sequenced virions. This will include more cells with low viral burden, as those are the ones for which we often don't sequence the full infecting virus.
The trends here remain largely similar to above, although there is now a borderline trend for IFN+ to be depleted in absence (enriched in presence) of PB2 and PA:
p_muts_ifn_all <- plot_muts(
complete_only=FALSE, show_counts=FALSE,
showQvalue=TRUE, ifn_or_isg="IFN")
saveShowPlot(p_muts_ifn_all, width=7, height=4.75, isfig=TRUE)
Next we check if the amount of viral mRNA is associated with IFN induction. Since lack of NS is a major determinant of IFN positivity, we do this faceting by NS presence. We see that among NS-expressing infected cells there is no association with the amount of flu and IFN induction. But among NS-deficient cells, there is a significant assocation between high amounts of viral mRNA and IFN induction.
ymax <- cell_infection_data$frac_mRNA_from_flu %>% max
p_viral_load_ifn <- ggplot(
cell_infection_data %>%
mutate(NS_state=ifelse(has_fluNS, "has NS", "lacks NS"),
IFN_state=gsub("IFN", "", IFN_state)),
aes(IFN_state, frac_mRNA_from_flu, fill=IFN_state)) +
geom_boxplot(outlier.size=1, outlier.alpha=0.4) +
stat_compare_means(
aes(label=..p.format..),
method="wilcox.test", hjust=0.5, vjust=0, label.x=1.5,
label.y=1.08 * ymax, fontface="italic"
) +
geom_segment(x=1, xend=2, y=1.04 * ymax, yend=1.04 * ymax, size=0.25) +
scale_fill_manual(values=rev(cbPalette[2:3])) +
theme(legend.position="none",
strip.text=element_text(size=14),
axis.text.x=element_text(size=14)) +
facet_wrap(~ NS_state, ncol=1) +
expand_limits(y=c(0, 1.17 * ymax)) +
scale_y_continuous(name="frac mRNA from flu") +
xlab("IFN state")
saveShowPlot(p_viral_load_ifn, width=2, height=3.8)
Now we parallel the analysis in the preceding section, but look for features that associate with ISG expression rather than IFN induction.
We see that similarly for IFN induction, ISG expression is more common and higher in cells that fail to express wildtype copies of all viral genes. The difference here is statisticall significant:
# perform analysis
p_frac_isg_wt_list <- plot_hist(ifn_or_isg="ISG", facet_var="wildtype_virus")
# get and show histogram
p_frac_isg_wt <- p_frac_isg_wt_list[[1]]
saveShowPlot(p_frac_isg_wt, 3, 3.8)
# show contingency table and Fisher's exact test result
p_frac_isg_wt_list[[2]]
p_frac_isg_wt_list[[3]]
As for IFN induction, there is no substantial association between known co-infections and ISG expression:
# perform analysis
p_frac_isg_coinfect_list <- plot_hist(ifn_or_isg="ISG", facet_var="coinfection_state")
# get and show histogram
p_frac_isg_coinfect <- p_frac_isg_coinfect_list[[1]]
saveShowPlot(p_frac_isg_coinfect, 3, 3.8)
# show contingency table and Fisher's exact test result
p_frac_isg_coinfect_list[[2]]
p_frac_isg_coinfect_list[[3]]
Most of the viral features that were associated with IFN induction are also associated with ISG expression, although on the absence of NS is significantly associated:
p_muts_isg_complete <- plot_muts(
complete_only=TRUE, show_counts=FALSE,
showQvalue=TRUE, ifn_or_isg="ISG")
saveShowPlot(p_muts_isg_complete, width=7, height=4.75)
Similar to above for IFN, cells that have NS exhibit no association between the amount of viral mRNA and ISG expression. But for cells that lack NS, there is a strong tendency for high-flu cells to have more ISG expression:
ymax <- cell_infection_data$frac_mRNA_from_flu %>% max
p_viral_load_isg <- ggplot(
cell_infection_data %>%
mutate(NS_state=ifelse(has_fluNS, "has NS", "lacks NS"),
ISG_state=gsub("ISG", "", ISG_state)),
aes(ISG_state, frac_mRNA_from_flu, fill=ISG_state)) +
geom_boxplot(outlier.size=1, outlier.alpha=0.4) +
stat_compare_means(
aes(label=..p.format..),
method="wilcox.test", hjust=0.5, vjust=0, label.x=1.5,
label.y=1.08 * ymax, fontface="italic"
) +
geom_segment(x=1, xend=2, y=1.04 * ymax, yend=1.04 * ymax, size=0.25) +
scale_fill_manual(values=c(cbPalette[3], cbPalette[7])) +
theme(legend.position="none",
strip.text=element_text(size=14),
axis.text.x=element_text(size=14)) +
facet_wrap(~ NS_state, ncol=1) +
expand_limits(y=c(0, 1.17 * ymax)) +
scale_y_continuous(name="frac mRNA from flu") +
xlab("ISG state")
saveShowPlot(p_viral_load_isg, width=2, height=3.8)
We have made all of the plots above, and saved some of them to the figures directory already by using the isfig=TRUE argument to saveShowPlot.
However, there are others that we want to assemble into multi-panel figures.
We do that here.
First, we assemble a figure that shows the calling of cells, infected cells, and IFN+ cells:
p_cell_summary <- plot_grid(
plot_grid(p_cellcounts, p_flu_vs_cell,
ncol=1, rel_heights=c(1, 1.5), scale=0.9,
labels=c("A", "B"), label_size=18, vjust=1),
plot_grid(p_frac_flu, labels="C", scale=0.95, label_size=18, vjust=1),
plot_grid(p_nflu_genes, labels="D", scale=0.95, label_size=18, vjust=1),
plot_grid(p_flu_rel_expr, p_coinfect,
scale=0.95, ncol=1,
labels=c("E", "F"), label_size=18, vjust=1),
plot_grid(p_ifn_dist, labels="G", scale=0.95, label_size=18, vjust=1),
nrow=1, scale=0.95, rel_widths=c(1, 0.7, 0.6, 0.75, 0.7), align='h'
) +
theme(plot.margin=unit(c(t=0, r=0, b=-0.3, l=0), "in"))
saveShowPlot(p_cell_summary, width=15.5, height=4.9, isfig=TRUE)
Next we assemble a figure that shows how viral mutations are associated with heterogeneity and IFN induction:
p_mutations <- plot_grid(
p_frac_flu_wt,
p_frac_ifn_wt,
p_muts_ifn_complete,
p_viral_load_ifn,
labels=c("A", "B", "C", "D"), label_size=18, vjust=1.2, hjust=-1,
nrow=1, scale=0.93, rel_widths=c(1, 1, 2.4, 0.6), align='h'
) +
theme(plot.margin=unit(c(t=0, r=0, b=-0.3, l=0), "in"))
saveShowPlot(p_mutations, width=15.5, height=4.9, isfig=TRUE)
Next assemble a merged figure supplement that shows how viral mutations are associated with ISG expression:
p_mutations_isg <- plot_grid(
p_frac_isg_wt,
p_muts_isg_complete,
p_viral_load_isg,
labels=c("A", "B", "C"), label_size=18, vjust=1.2, hjust=-1,
nrow=1, scale=0.93, rel_widths=c(1, 2.4, 0.6), align='h'
) +
theme(plot.margin=unit(c(t=0, r=0, b=-0.3, l=0), "in"))
saveShowPlot(p_mutations_isg, width=12.5, height=4.9, isfig=TRUE)